Grader: pairwise (LLM judge)

The pairwise grader sends two trajectories to an LLM judge and asks it to compare them. It uses position-swap debiasing — running the comparison twice with A/B order reversed — to reduce order bias.

Taxonomy

Property	Value
Determinism	`llm`
Cost	`high`
Portability	`t3a-scenario`
Reference	`reference-based` (requires two trajectories)
Temporal scope	`cross-trajectory`
Score kind	`llm`

Config

graders:
  - type: pairwise
    config:
      prompt: "Which response better follows coding best practices?"
      model: gpt-5.5 # optional: override judge model
      threshold: 0.5 # optional: score needed to "pass"

Field	Type	Required	Default	Description
`prompt`	string	No	Default comparison rubric	Custom comparison prompt for the judge
`model`	string	No	`claude-sonnet-4.6` (or `--judge-model`)	Which model to use as the judge
`threshold`	number	No	`0.5`	Score threshold for passing (0.5 = tie or better)

How it works

Two trajectories are loaded — run A (typically baseline) and run B (typically the new version)
The judge must call the submit_pairwise_grade tool exactly once with per-criterion winners, magnitudes, and reasoning
If the judge fails to call the tool or sends invalid arguments, an in-session reminder nudges it to retry (up to 2 times)
To eliminate order bias, the comparison runs twice with A/B positions swapped
Results are merged with consistency checking — if the judge picks A both times (even when swapped), that’s a strong signal

Score mapping

The judge returns a magnitude indicating how much better one response is:

Magnitude	Score	Meaning
`much-better`	1.0	A is clearly superior to B
`slightly-better`	0.75	A is somewhat better than B
`equal`	0.5	No meaningful difference

A score > 0.5 means A (baseline) is better. A score < 0.5 means B (candidate) is better. Exactly 0.5 is a tie.

Usage

Pairwise graders are defined in eval.yaml alongside other graders but only execute during compare:

stimuli:
  - name: test-generation
    prompt: "Write tests for add(a, b)"
    graders:
      - type: file-exists
        config: { path: "*.test.js" }
      - type: pairwise # only runs in compare
        config:
          prompt: "Which tests are more comprehensive?"

Then run the comparison:

# Run baseline
vally eval --eval-spec eval.yaml --output-dir ./baseline

# Make changes, run again
vally eval --eval-spec eval.yaml --output-dir ./candidate

# Compare
vally compare \
  --eval-spec eval.yaml \
  --run-a ./baseline \
  --run-b ./candidate

Judge model resolution

Same chain as the prompt grader:

Grader-level config.model (most specific)
--judge-model CLI flag
config.judge_model in eval.yaml
Default: claude-sonnet-4.6

Evidence examples

✔ pairwise  A is slightly-better (score: 0.75) — consistent across position swap
✘ pairwise  B is much-better (score: 0.0) — A regressed on code quality