Skip to content

CLI: compare

Terminal window
vally compare [options]

Compare two sets of eval trajectories using pairwise LLM judge graders. Loads trajectories from both runs, matches them by stimulus name and trial index, and runs only type: pairwise graders from the eval spec.

FlagTypeRequiredDescription
--eval-spec <path>stringYesPath to eval spec file (only pairwise graders are used)
--run-a <path>stringYesPath to baseline run (directory or JSONL file)
--run-b <path>stringYesPath to candidate run (directory or JSONL file)
--judge-model <model>stringNoModel for LLM judge graders
--grader-plugin <specifier>stringNoGrader plugin to load (npm package name or local path)
--verbosebooleanNoShow detailed grader evidence
CodeMeaning
0Comparison completed
1Error occurred
Terminal window
# Compare two eval runs
vally compare \
--eval-spec eval.yaml \
--run-a ./results/baseline \
--run-b ./results/candidate
# With a specific judge model
vally compare \
--eval-spec eval.yaml \
--run-a ./results/baseline \
--run-b ./results/candidate \
--judge-model gpt-5.5
# Verbose output
vally compare \
--eval-spec eval.yaml \
--run-a ./results/v1.jsonl \
--run-b ./results/v2.jsonl \
--verbose

The typical A/B comparison workflow:

  1. Run eval on the baseline: vally eval --eval-spec eval.yaml --output-dir ./baseline
  2. Make your changes (update skill, change model, etc.)
  3. Run eval on the candidate: vally eval --eval-spec eval.yaml --output-dir ./candidate
  4. Compare: vally compare --eval-spec eval.yaml --run-a ./baseline --run-b ./candidate

The compare command only extracts type: pairwise graders from the eval spec — all other grader types are ignored. Define pairwise graders alongside your regular graders in the same eval.yaml.