Grader: pairwise (LLM judge)
The pairwise grader sends two trajectories to an LLM judge and asks it to compare them. It uses position-swap debiasing — running the comparison twice with A/B order reversed — to reduce order bias.
Taxonomy
Section titled “Taxonomy”| Property | Value |
|---|---|
| Determinism | llm |
| Cost | high |
| Portability | t3a-scenario |
| Reference | reference-based (requires two trajectories) |
| Temporal scope | cross-trajectory |
| Score kind | llm |
Config
Section titled “Config”graders: - type: pairwise config: prompt: "Which response better follows coding best practices?" model: gpt-5.5 # optional: override judge model threshold: 0.5 # optional: score needed to "pass"| Field | Type | Required | Default | Description |
|---|---|---|---|---|
prompt | string | No | Default comparison rubric | Custom comparison prompt for the judge |
model | string | No | claude-sonnet-4.6 (or --judge-model) | Which model to use as the judge |
threshold | number | No | 0.5 | Score threshold for passing (0.5 = tie or better) |
How it works
Section titled “How it works”- Two trajectories are loaded — run A (typically baseline) and run B (typically the new version)
- The judge must call the
submit_pairwise_gradetool exactly once with per-criterion winners, magnitudes, and reasoning - If the judge fails to call the tool or sends invalid arguments, an in-session reminder nudges it to retry (up to 2 times)
- To eliminate order bias, the comparison runs twice with A/B positions swapped
- Results are merged with consistency checking — if the judge picks A both times (even when swapped), that’s a strong signal
Score mapping
Section titled “Score mapping”The judge returns a magnitude indicating how much better one response is:
| Magnitude | Score | Meaning |
|---|---|---|
much-better | 1.0 | A is clearly superior to B |
slightly-better | 0.75 | A is somewhat better than B |
equal | 0.5 | No meaningful difference |
A score > 0.5 means A (baseline) is better. A score < 0.5 means B (candidate) is better. Exactly 0.5 is a tie.
Pairwise graders are defined in eval.yaml alongside other graders but only execute during compare:
stimuli: - name: test-generation prompt: "Write tests for add(a, b)" graders: - type: file-exists config: { path: "*.test.js" } - type: pairwise # only runs in compare config: prompt: "Which tests are more comprehensive?"Then run the comparison:
# Run baselinevally eval --eval-spec eval.yaml --output-dir ./baseline
# Make changes, run againvally eval --eval-spec eval.yaml --output-dir ./candidate
# Comparevally compare \ --eval-spec eval.yaml \ --run-a ./baseline \ --run-b ./candidateJudge model resolution
Section titled “Judge model resolution”Same chain as the prompt grader:
- Grader-level
config.model(most specific) --judge-modelCLI flagconfig.judge_modelin eval.yaml- Default:
claude-sonnet-4.6
Evidence examples
Section titled “Evidence examples”✔ pairwise A is slightly-better (score: 0.75) — consistent across position swap✘ pairwise B is much-better (score: 0.0) — A regressed on code quality