Scoring

Scoring in Vally answers one question: how good is this skill? The system gives you several metrics, each telling you something different. This page explains what they mean and how to use them.

The basics: graders produce scores

Every grader returns a score between 0 and 1 and a passed boolean. When a stimulus has multiple graders, their scores are combined into an aggregate score:

scoring:
  weights:
    file-exists: 0.7 # this check matters most
    output-contains: 0.3 # this check matters less
  threshold: 0.7 # need ≥ 70% to pass

Weights must form a normalized distribution — they must sum to 1.0 (±0.01 tolerance for two-decimal-place rounding). The vally lint command reports a scoring-weight-sum error when they don’t.

If file-exists scores 1.0 and output-contains scores 0.0, the aggregate is the weighted sum:

0.7 × 1.0 + 0.3 × 0.0 = 0.70 → equals 0.7 → PASS

When a stimulus has multiple graders of the same type (e.g., two output-contains checks), their scores are averaged first, then the type’s weight is applied once:

# Two output-contains graders: one passes (1.0), one fails (0.0)
# output-contains avg = (1.0 + 0.0) / 2 = 0.5
0.3 × 0.5 + 0.7 × 1.0 = 0.85

Graders absent from weights receive an implicit weight of 0 and do not contribute to the score. If you omit weights entirely, all graders are averaged equally. If no threshold is configured in eval.yaml or on the CLI, vally eval uses the binary grader pass/fail verdict.

Why run multiple trials?

LLM-based agents are non-deterministic. The same prompt can succeed on one run and fail the next. A single run tells you what happened that time — not what will happen in general.

Running multiple trials (e.g., defaults.runs: 5) lets you measure how often the agent succeeds, which is far more useful than a single pass/fail.

The three metrics that matter

Pass rate — “how often does it work?”

The simplest metric. Out of K trials, how many passed?

3 out of 5 trials passed → pass rate = 60%

This is your starting point. But pass rate alone can be misleading when K is small (which it always is in evals — you’re not running 1,000 trials).

pass@k — “can it do this at all?”

pass@k answers: if I give the agent K attempts, what’s the chance at least one succeeds?

This measures capability. A skill with 30% pass rate might seem bad, but pass@5 is 83% — meaning if you try 5 times, you’ll almost certainly get a good result at least once.

Pass rate	pass@1	pass@3	pass@5
90%	90%	99.9%	~100%
50%	50%	87.5%	96.9%
30%	30%	65.7%	83.2%
10%	10%	27.1%	41.0%

When to use pass@k: Evaluating whether a skill has the fundamental capability to solve a problem. Good for early development and capability benchmarking.

pass^k — “can I rely on this?”

pass^k answers: if I run the agent K times, what’s the chance every single run succeeds?

This measures reliability. A skill with 80% pass rate sounds solid, but pass^5 is only 33% — meaning in 5 runs, there’s a 2-in-3 chance at least one will fail.

Pass rate	pass^1	pass^3	pass^5
95%	95%	85.7%	77.4%
90%	90%	72.9%	59.0%
80%	80%	51.2%	32.8%
70%	70%	34.3%	16.8%

When to use pass^k: Deciding whether a skill is reliable enough for production or CI gating. If pass^k is low, the skill works but isn’t dependable.

Flakiness — “is this consistent?”

When some trials pass and others fail, the stimulus is flaky. Vally flags this explicitly:

━━━ basic-test-generation (5 trials) ━━━
  Trial 1  ✔    Trial 2  ✔    Trial 3  ✘    Trial 4  ✔    Trial 5  ✔

  pass rate: 4/5 (80%)    pass@5: 99.7%    pass^5: 32.8%
  ⚠ flaky (20.0% minority outcomes)

The flakinessPercent tells you what fraction of outcomes were in the minority. Here, 1 out of 5 failed, so flakiness is 20%.

How to interpret flakiness:

0% → perfectly consistent (all pass or all fail)
< 20% → mostly stable, occasional failures — investigate the failing cases
20–40% → unreliable — the skill needs work before CI gating
> 40% → nearly random — the skill or the eval likely has a fundamental problem

Putting it together: reading a score report

Here’s a real score report and what each number tells you:

SkillScore: test-writer
├── overallScore: 0.85           ← average across stimuli
├── passed: true                 ← overall score meets the pass threshold
│
├── basic-test-generation
│   ├── aggregateScore: 0.95     ← this stimulus scores well
│   ├── pass@3: 100%             ← agent can definitely do this
│   ├── pass^3: 85.7%            ← and does it reliably
│   └── flaky: false             ← consistent results
│
└── edge-case-handling
    ├── aggregateScore: 0.75     ← this stimulus is weaker
    ├── pass@3: 97.5%            ← agent CAN do it
    ├── pass^3: 42.2%            ← but not reliably
    └── flaky: true (33.3%)      ← 1 in 3 trials fails

What to do with this: The agent has the capability for both stimuli (high pass@k), but edge-case handling is unreliable (low pass^k, flaky). Focus improvement efforts there.

Choosing runs and thresholds

How many runs?

Context	Recommended `runs`	Why
Inner loop (lint)	`0`	No execution — static checks only
CI gate	`3`	Enough to detect flakiness without slowing PRs
Outer loop / nightly	`5–10`	More samples for accurate pass@k with LLM judges

What threshold?

Goal	Recommended `threshold`	Metric to watch
“Does this work at all?”	`0.1`	pass@k — is the capability there?
“Is this reliable for CI?”	`0.7`	pass rate — does it usually work?
“Is this production-ready?”	`0.9`	pass^k — does it always work?

Next steps

How it works — where scoring fits in the pipeline
Writing eval specs — configure scoring in practice
Scoring functions reference — implementation details and formulas