Skip to content

Scoring

Scoring in Vally answers one question: how good is this skill? The system gives you several metrics, each telling you something different. This page explains what they mean and how to use them.

Every grader returns a score between 0 and 1 and a passed boolean. When a stimulus has multiple graders, their scores are combined into an aggregate score:

eval.yaml
scoring:
weights:
file-exists: 1.0 # this check matters most
output-contains: 0.5 # this check matters less
threshold: 0.7 # need ≥ 70% to pass

If file-exists scores 1.0 and output-contains scores 0.0, the intended aggregate would be:

(1.0 × 1.0 + 0.5 × 0.0) / (1.0 + 0.5) = 0.67 → below 0.7 → FAIL

If you omit weights, all graders are weighted equally. If no threshold is configured in eval.yaml or on the CLI, vally eval uses the binary grader pass/fail verdict.

LLM-based agents are non-deterministic. The same prompt can succeed on one run and fail the next. A single run tells you what happened that time — not what will happen in general.

Running multiple trials (e.g., config.runs: 5) lets you measure how often the agent succeeds, which is far more useful than a single pass/fail.

Pass rate — “how often does it work?”

Section titled “Pass rate — “how often does it work?””

The simplest metric. Out of K trials, how many passed?

3 out of 5 trials passed → pass rate = 60%

This is your starting point. But pass rate alone can be misleading when K is small (which it always is in evals — you’re not running 1,000 trials).

pass@k answers: if I give the agent K attempts, what’s the chance at least one succeeds?

This measures capability. A skill with 30% pass rate might seem bad, but pass@5 is 83% — meaning if you try 5 times, you’ll almost certainly get a good result at least once.

Pass ratepass@1pass@3pass@5
90%90%99.9%~100%
50%50%87.5%96.9%
30%30%65.7%83.2%
10%10%27.1%41.0%

When to use pass@k: Evaluating whether a skill has the fundamental capability to solve a problem. Good for early development and capability benchmarking.

pass^k answers: if I run the agent K times, what’s the chance every single run succeeds?

This measures reliability. A skill with 80% pass rate sounds solid, but pass^5 is only 33% — meaning in 5 runs, there’s a 2-in-3 chance at least one will fail.

Pass ratepass^1pass^3pass^5
95%95%85.7%77.4%
90%90%72.9%59.0%
80%80%51.2%32.8%
70%70%34.3%16.8%

When to use pass^k: Deciding whether a skill is reliable enough for production or CI gating. If pass^k is low, the skill works but isn’t dependable.

When some trials pass and others fail, the stimulus is flaky. Vally flags this explicitly:

━━━ basic-test-generation (5 trials) ━━━
Trial 1 ✔ Trial 2 ✔ Trial 3 ✘ Trial 4 ✔ Trial 5 ✔
pass rate: 4/5 (80%) pass@5: 99.7% pass^5: 32.8%
⚠ flaky (20.0% minority outcomes)

The flakinessPercent tells you what fraction of outcomes were in the minority. Here, 1 out of 5 failed, so flakiness is 20%.

How to interpret flakiness:

  • 0% → perfectly consistent (all pass or all fail)
  • < 20% → mostly stable, occasional failures — investigate the failing cases
  • 20–40% → unreliable — the skill needs work before CI gating
  • > 40% → nearly random — the skill or the eval likely has a fundamental problem

Putting it together: reading a score report

Section titled “Putting it together: reading a score report”

Here’s a real score report and what each number tells you:

SkillScore: test-writer
├── overallScore: 0.85 ← average across stimuli
├── passed: true ← overall score meets the pass threshold
├── basic-test-generation
│ ├── aggregateScore: 0.95 ← this stimulus scores well
│ ├── pass@3: 100% ← agent can definitely do this
│ ├── pass^3: 85.7% ← and does it reliably
│ └── flaky: false ← consistent results
└── edge-case-handling
├── aggregateScore: 0.75 ← this stimulus is weaker
├── pass@3: 97.5% ← agent CAN do it
├── pass^3: 42.2% ← but not reliably
└── flaky: true (33.3%) ← 1 in 3 trials fails

What to do with this: The agent has the capability for both stimuli (high pass@k), but edge-case handling is unreliable (low pass^k, flaky). Focus improvement efforts there.

ContextRecommended runsWhy
Inner loop (lint)0No execution — static checks only
CI gate3Enough to detect flakiness without slowing PRs
Outer loop / nightly5–10More samples for accurate pass@k with LLM judges
GoalRecommended thresholdMetric to watch
”Does this work at all?”0.1pass@k — is the capability there?
”Is this reliable for CI?”0.7pass rate — does it usually work?
”Is this production-ready?”0.9pass^k — does it always work?