Skip to content

Grader: pairwise (LLM judge)

The pairwise grader sends two trajectories to an LLM judge and asks it to compare them. It uses position-swap debiasing — running the comparison twice with A/B order reversed — to reduce order bias.

PropertyValue
Determinismllm
Costhigh
Portabilityt3a-scenario
Referencereference-based (requires two trajectories)
Temporal scopecross-trajectory
Score kindllm
graders:
- type: pairwise
config:
prompt: "Which response better follows coding best practices?"
model: gpt-5.5 # optional: override judge model
threshold: 0.5 # optional: score needed to "pass"
FieldTypeRequiredDefaultDescription
promptstringNoDefault comparison rubricCustom comparison prompt for the judge
modelstringNoclaude-sonnet-4.6 (or --judge-model)Which model to use as the judge
thresholdnumberNo0.5Score threshold for passing (0.5 = tie or better)
  1. Two trajectories are loaded — run A (typically baseline) and run B (typically the new version)
  2. The judge must call the submit_pairwise_grade tool exactly once with per-criterion winners, magnitudes, and reasoning
  3. If the judge fails to call the tool or sends invalid arguments, an in-session reminder nudges it to retry (up to 2 times)
  4. To eliminate order bias, the comparison runs twice with A/B positions swapped
  5. Results are merged with consistency checking — if the judge picks A both times (even when swapped), that’s a strong signal

The judge returns a magnitude indicating how much better one response is:

MagnitudeScoreMeaning
much-better1.0A is clearly superior to B
slightly-better0.75A is somewhat better than B
equal0.5No meaningful difference

A score > 0.5 means A (baseline) is better. A score < 0.5 means B (candidate) is better. Exactly 0.5 is a tie.

Pairwise graders are defined in eval.yaml alongside other graders but only execute during compare:

eval.yaml
stimuli:
- name: test-generation
prompt: "Write tests for add(a, b)"
graders:
- type: file-exists
config: { path: "*.test.js" }
- type: pairwise # only runs in compare
config:
prompt: "Which tests are more comprehensive?"

Then run the comparison:

Terminal window
# Run baseline
vally eval --eval-spec eval.yaml --output-dir ./baseline
# Make changes, run again
vally eval --eval-spec eval.yaml --output-dir ./candidate
# Compare
vally compare \
--eval-spec eval.yaml \
--run-a ./baseline \
--run-b ./candidate

Same chain as the prompt grader:

  1. Grader-level config.model (most specific)
  2. --judge-model CLI flag
  3. config.judge_model in eval.yaml
  4. Default: claude-sonnet-4.6
✔ pairwise A is slightly-better (score: 0.75) — consistent across position swap
✘ pairwise B is much-better (score: 0.0) — A regressed on code quality