Scoring
Scoring in Vally answers one question: how good is this skill? The system gives you several metrics, each telling you something different. This page explains what they mean and how to use them.
The basics: graders produce scores
Section titled “The basics: graders produce scores”Every grader returns a score between 0 and 1 and a passed boolean. When a stimulus has multiple graders, their scores are combined into an aggregate score:
scoring: weights: file-exists: 1.0 # this check matters most output-contains: 0.5 # this check matters less threshold: 0.7 # need ≥ 70% to passIf file-exists scores 1.0 and output-contains scores 0.0, the intended aggregate would be:
(1.0 × 1.0 + 0.5 × 0.0) / (1.0 + 0.5) = 0.67 → below 0.7 → FAILIf you omit weights, all graders are weighted equally. If no threshold is configured in eval.yaml or on the CLI, vally eval uses the binary grader pass/fail verdict.
Why run multiple trials?
Section titled “Why run multiple trials?”LLM-based agents are non-deterministic. The same prompt can succeed on one run and fail the next. A single run tells you what happened that time — not what will happen in general.
Running multiple trials (e.g., config.runs: 5) lets you measure how often the agent succeeds, which is far more useful than a single pass/fail.
The three metrics that matter
Section titled “The three metrics that matter”Pass rate — “how often does it work?”
Section titled “Pass rate — “how often does it work?””The simplest metric. Out of K trials, how many passed?
3 out of 5 trials passed → pass rate = 60%This is your starting point. But pass rate alone can be misleading when K is small (which it always is in evals — you’re not running 1,000 trials).
pass@k — “can it do this at all?”
Section titled “pass@k — “can it do this at all?””pass@k answers: if I give the agent K attempts, what’s the chance at least one succeeds?
This measures capability. A skill with 30% pass rate might seem bad, but pass@5 is 83% — meaning if you try 5 times, you’ll almost certainly get a good result at least once.
| Pass rate | pass@1 | pass@3 | pass@5 |
|---|---|---|---|
| 90% | 90% | 99.9% | ~100% |
| 50% | 50% | 87.5% | 96.9% |
| 30% | 30% | 65.7% | 83.2% |
| 10% | 10% | 27.1% | 41.0% |
When to use pass@k: Evaluating whether a skill has the fundamental capability to solve a problem. Good for early development and capability benchmarking.
pass^k — “can I rely on this?”
Section titled “pass^k — “can I rely on this?””pass^k answers: if I run the agent K times, what’s the chance every single run succeeds?
This measures reliability. A skill with 80% pass rate sounds solid, but pass^5 is only 33% — meaning in 5 runs, there’s a 2-in-3 chance at least one will fail.
| Pass rate | pass^1 | pass^3 | pass^5 |
|---|---|---|---|
| 95% | 95% | 85.7% | 77.4% |
| 90% | 90% | 72.9% | 59.0% |
| 80% | 80% | 51.2% | 32.8% |
| 70% | 70% | 34.3% | 16.8% |
When to use pass^k: Deciding whether a skill is reliable enough for production or CI gating. If pass^k is low, the skill works but isn’t dependable.
Flakiness — “is this consistent?”
Section titled “Flakiness — “is this consistent?””When some trials pass and others fail, the stimulus is flaky. Vally flags this explicitly:
━━━ basic-test-generation (5 trials) ━━━ Trial 1 ✔ Trial 2 ✔ Trial 3 ✘ Trial 4 ✔ Trial 5 ✔
pass rate: 4/5 (80%) pass@5: 99.7% pass^5: 32.8% ⚠ flaky (20.0% minority outcomes)The flakinessPercent tells you what fraction of outcomes were in the minority. Here, 1 out of 5 failed, so flakiness is 20%.
How to interpret flakiness:
- 0% → perfectly consistent (all pass or all fail)
- < 20% → mostly stable, occasional failures — investigate the failing cases
- 20–40% → unreliable — the skill needs work before CI gating
- > 40% → nearly random — the skill or the eval likely has a fundamental problem
Putting it together: reading a score report
Section titled “Putting it together: reading a score report”Here’s a real score report and what each number tells you:
SkillScore: test-writer├── overallScore: 0.85 ← average across stimuli├── passed: true ← overall score meets the pass threshold│├── basic-test-generation│ ├── aggregateScore: 0.95 ← this stimulus scores well│ ├── pass@3: 100% ← agent can definitely do this│ ├── pass^3: 85.7% ← and does it reliably│ └── flaky: false ← consistent results│└── edge-case-handling ├── aggregateScore: 0.75 ← this stimulus is weaker ├── pass@3: 97.5% ← agent CAN do it ├── pass^3: 42.2% ← but not reliably └── flaky: true (33.3%) ← 1 in 3 trials failsWhat to do with this: The agent has the capability for both stimuli (high pass@k), but edge-case handling is unreliable (low pass^k, flaky). Focus improvement efforts there.
Choosing runs and thresholds
Section titled “Choosing runs and thresholds”How many runs?
Section titled “How many runs?”| Context | Recommended runs | Why |
|---|---|---|
| Inner loop (lint) | 0 | No execution — static checks only |
| CI gate | 3 | Enough to detect flakiness without slowing PRs |
| Outer loop / nightly | 5–10 | More samples for accurate pass@k with LLM judges |
What threshold?
Section titled “What threshold?”| Goal | Recommended threshold | Metric to watch |
|---|---|---|
| ”Does this work at all?” | 0.1 | pass@k — is the capability there? |
| ”Is this reliable for CI?” | 0.7 | pass rate — does it usually work? |
| ”Is this production-ready?” | 0.9 | pass^k — does it always work? |
Next steps
Section titled “Next steps”- How it works — where scoring fits in the pipeline
- Writing eval specs — configure scoring in practice
- Scoring functions reference — implementation details and formulas