CLI: compare
vally compare [options]Description
Section titled “Description”Compare two sets of eval trajectories using pairwise LLM judge graders. Loads trajectories from both runs, matches them by stimulus name and trial index, and runs only type: pairwise graders from the eval spec.
Options
Section titled “Options”| Flag | Type | Required | Description |
|---|---|---|---|
--eval-spec <path> | string | Yes | Path to eval spec file (only pairwise graders are used) |
--run-a <path> | string | Yes | Path to baseline run (directory or JSONL file) |
--run-b <path> | string | Yes | Path to candidate run (directory or JSONL file) |
--judge-model <model> | string | No | Model for LLM judge graders |
--grader-plugin <specifier> | string | No | Grader plugin to load (npm package name or local path) |
--verbose | boolean | No | Show detailed grader evidence |
Exit codes
Section titled “Exit codes”| Code | Meaning |
|---|---|
0 | Comparison completed |
1 | Error occurred |
Examples
Section titled “Examples”# Compare two eval runsvally compare \ --eval-spec eval.yaml \ --run-a ./results/baseline \ --run-b ./results/candidate
# With a specific judge modelvally compare \ --eval-spec eval.yaml \ --run-a ./results/baseline \ --run-b ./results/candidate \ --judge-model gpt-5.5
# Verbose outputvally compare \ --eval-spec eval.yaml \ --run-a ./results/v1.jsonl \ --run-b ./results/v2.jsonl \ --verboseWorkflow
Section titled “Workflow”The typical A/B comparison workflow:
- Run eval on the baseline:
vally eval --eval-spec eval.yaml --output-dir ./baseline - Make your changes (update skill, change model, etc.)
- Run eval on the candidate:
vally eval --eval-spec eval.yaml --output-dir ./candidate - Compare:
vally compare --eval-spec eval.yaml --run-a ./baseline --run-b ./candidate
The compare command only extracts type: pairwise graders from the eval spec — all other grader types are ignored. Define pairwise graders alongside your regular graders in the same eval.yaml.