CLI: compare

Usage

vally compare <experiment-dir> [options]
vally compare --baseline <path> --treatment <path> [options]

Description

Compares a treatment run against a baseline and reports a signed, treatment-relative verdict plus metric deltas and statistics. The prompt judge compares each stimulus against its rubric, which is read from the trajectories — so no eval spec is required.

There are two input modes:

Experiment mode — vally compare <experiment-dir>. Reads each variant’s results.jsonl from an experiment output directory, discovers the baseline variant from the embedded provenance, and fans out into one baseline-vs-treatment comparison per other variant.
Two-run mode — vally compare --baseline <path> --treatment <path>. Compares two independent runs of the same eval spec (for example, a CI regression check across time). Trajectories are paired by stimulus name and trial index.

Options

Flag	Type	Required	Description
`<experiment-dir>`	string	mode	Experiment output directory (experiment mode). Mutually exclusive with the two-run flags.
`--baseline <path>`	string	mode	Baseline run directory or JSONL file (two-run mode).
`--treatment <path>`	string	mode	Treatment run directory or JSONL file (two-run mode).
`--baseline-variant <name>`	string	No	Experiment mode: override the baseline variant recorded in the output.
`-e, --eval-spec <path>`	string	No	Optional eval spec to override the rubric embedded in the trajectories.
`--judge-model <model>`	string	No	Model for the comparison judge.
`--judge-reasoning-effort <level>`	string	No	Reasoning effort for the comparison judge model: `low`, `medium`, `high`, or `xhigh`.
`--output <path>`	string	No	Write per-treatment comparison records to this JSONL file.
`--fail-on-regression`	boolean	No	Exit non-zero if a treatment scores significantly worse than the baseline — a negative mean score whose 95% CI is entirely below zero (or every comparison fails).
`--param <key=value>`	string	No	Set a param value (repeatable, e.g. `--param MODEL=gpt-4o`). Overrides eval param files and `.vally.yaml`.
`--verbose`	boolean	No	Show per-stimulus and per-trial detail.

Provide either an experiment directory or both --baseline and --treatment — not both.

Output

For each treatment, compare prints:

Win rate — wins / ties / losses across the paired trials.
Mean score — the signed treatment-relative score in [-1, 1] (+ means the treatment is better) with a 95% confidence interval.
McNemar test — a paired test on pass/fail flips between baseline and treatment, when the trajectories carry grade results.
Metric deltas — per-metric treatment − baseline differences (tokens, turns, tool calls, wall time, errors) with confidence intervals and a high-variance flag.

Failed judge calls are excluded from the statistics.

Scoring scale

The comparison verdict uses a signed five-bucket magnitude, from the treatment’s perspective:

Verdict	Score
much better	`+1.0`
slightly better	`+0.4`
equal / tie	`0`
slightly worse	`-0.4`
much worse	`-1.0`

The judge runs each comparison twice with the two responses swapped (position-swap debiasing); if the two directions disagree on the winner, the result is forced to a tie.

Exit codes

Code	Meaning
`0`	Comparison completed (and no regression, when `--fail-on-regression`)
`1`	Error occurred, or a regression was detected with `--fail-on-regression`

Examples

# Compare every variant of an experiment against its baseline
vally compare ./experiments/output/2026-06-26T12-00-00

# Ad-hoc comparison of two independent runs (CI regression check)
vally compare \
  --baseline ./results/main \
  --treatment ./results/pr-branch \
  --fail-on-regression

# Override the rubric with an eval spec and pick a judge model
vally compare ./experiments/output/latest \
  --eval-spec eval.yaml \
  --judge-model gpt-5.5 \
  --verbose

# Write machine-readable comparison records
vally compare \
  --baseline ./results/v1.jsonl \
  --treatment ./results/v2.jsonl \
  --output ./comparison.jsonl

Workflow

A typical A/B comparison:

Run eval on the baseline: vally eval --eval-spec eval.yaml --output-dir ./baseline
Make your change (update a skill, change a model, etc.).
Run eval on the treatment: vally eval --eval-spec eval.yaml --output-dir ./treatment
Compare: vally compare --baseline ./baseline --treatment ./treatment

For multi-variant experiments, run vally experiment run and point compare at the experiment output directory instead — it compares every variant against the declared baseline in one pass.