CLI: eval
vally eval [options]Description
Section titled “Description”Run stimuli against an agent and grade the results. Stimuli come from one of three sources:
--eval-spec/-e— load stimuli from one or more eval spec files.--suite— run a named suite of evals defined in.vally.yaml.--tag— discover and filter eval stimuli by tag.
Options
Section titled “Options”| Flag | Type | Default | Description |
|---|---|---|---|
--eval-spec, -e <path> | string | — | Path to eval spec file (repeatable) |
--skill-dir <path> | string | — | Opt-in: discover and load skills from this directory (same detection as vally lint <path>) for the whole run. A missing/non-directory path fails the command. Omit it to load only skills specified in environment.skills. |
--work-dir <path> | string | cwd | Working directory for agent runs |
--model <model> | string | From eval spec | Model to use for agent execution. Comma-separated list runs evals against each model (e.g., --model gpt-5.5,claude-opus-4.8). |
--timeout <duration> | duration | 2m | Timeout per run (e.g. 5m, 300s, 30000ms). Unit suffix required. Falls back to config.timeout from the eval spec, then to 2m. |
--threshold <number> | number | From eval spec | Override the eval scoring threshold for this command invocation. Must be between 0 and 1. When no threshold is configured, evals use the binary grader pass/fail verdict. |
--judge-model <model> | string | From eval spec | Model for LLM judge graders (overrides config.judge_model in the eval spec) |
--grader-plugin <specifier> | string | — | Grader plugin to load (npm package name or local path). Registers external graders before validation and execution. |
--executor-plugin <specifier> | string | — | Executor plugin to load (npm package name or local path). Repeatable — pass multiple flags when running eval specs that select different executors via their config.executor field. |
--eval-plugin <specifier> | string | — | Eval provider plugin to load (npm package name or local path). Requires --eval-spec; cannot be combined with --suite. |
--output-dir <path> | string | ./vally-results | Parent directory for eval outputs. Each run is written to a timestamped subdirectory such as <output-dir>/<timestamp>/, containing files like trajectories and eval outcomes. |
--output <format> | jsonl | — | Output format. jsonl emits one JSON object per stimulus to stdout. |
--junit | boolean | false | Write a JUnit XML report (eval-results.junit.xml) to the run’s output directory. |
--workspace <path> | string | — | Workspace directory for the agent (preserved after run). |
--runs <n> | number | From eval spec | Number of trials per stimulus (overrides config.runs). ≥ 2 enables multi-trial mode with pass@k metrics. |
--max-retries <n> | number | 2 | Max retries per stimulus on retryable errors (timeouts, rate limits, and transient connection failures). Each retry uses a fresh workspace with exponential backoff. Set to 0 to disable. |
--skip-grade | boolean | false | Run agents without grading (grade later with the grade command) |
--skip-validate | boolean | false | Skip eval spec validation (validation runs automatically by default) |
--workers <n> | number | 5 | Number of concurrent stimulus sessions |
--shutdown-timeout <duration> | duration | 5s | Max time to wait for executors to shut down during cleanup, including on Ctrl+C (e.g. 5s, 500ms). |
--suite <name> | string | — | Run a named suite of evals from .vally.yaml. Cannot be combined with --eval-spec. |
--tag <key=values> | string | — | Filter stimuli by tag (repeatable, e.g., --tag cost=free,low). |
--verbose | boolean | false | Increase logging verbosity and show full agent output |
Exit codes
Section titled “Exit codes”| Code | Meaning |
|---|---|
0 | All eval verdicts passed, or verdict failures were suppressed by an override such as --threshold 0 |
1 | One or more eval verdicts failed, or an execution/tooling error occurred |
When a threshold is configured via --threshold or scoring.threshold, the eval verdict is based on the aggregate score meeting that threshold. If no threshold is configured, the command uses the binary grader pass/fail verdict. Execution and tooling errors always exit with code 1, even when --threshold 0 is used.
Skills
Section titled “Skills”A run is skill-free by default. Skills load only when you opt in through one of two explicit mechanisms:
environment.skills(recommended) — list skill directories in an eval spec’senvironmentblock. A top-level block applies to all stimuli in the spec; an individual stimulus can declare its own. Paths resolve relative to the eval file. Each listed directory is copied into the agent’s isolated workspace, so the agent sees exactly the declared skills and nothing else. A missing directory, a directory without aSKILL.md, or an unparseableSKILL.mdfails the run fast rather than silently loading nothing.--skill-dir <path>(whole run, opt-in) — discover and load skills from<path>(the same detectionvally lint <path>uses) for all stimuli in the run. Useful for ad-hoc “load this folder of skills” runs. A missing or non-directory path fails the command.
The two mechanisms don’t stack: when environment.skills is declared (in an eval spec, applying to all its stimuli, or on an individual stimulus), that list is the complete skill set and the --skill-dir base is ignored (replace, not merge). Stimuli with no environment.skills fall back to the --skill-dir set.
Examples
Section titled “Examples”With eval spec
Section titled “With eval spec”# Run all stimuli from eval.yamlvally eval \ --eval-spec eval.yaml \ --skill-dir ./skills/test-writer \ --output-dir ./results \ --model gpt-5.5
# Run multiple eval spec filesvally eval -e auth.eval.yaml -e perf.eval.yaml \ --skill-dir ./skills/test-writer
# Compare models side-by-sidevally eval \ --eval-spec eval.yaml \ --model gpt-5.5,claude-opus-4.8 \ --output-dir ./results
# Keep workflow steps running for poor eval performance, but still fail on execution errorsvally eval \ --eval-spec eval.yaml \ --threshold 0Capture without grading
Section titled “Capture without grading”# Capture trajectories now, grade latervally eval \ --eval-spec eval.yaml \ --skip-grade \ --output jsonl > outcomes.jsonl
# Later: grade with updated gradersvally grade --eval-spec eval.yaml < outcomes.jsonlPreserve workspace
Section titled “Preserve workspace”# Keep agent's workspace for inspectionvally eval \ --eval-spec eval.yaml \ --workspace ./debug-workspace \ --verboseTrajectory output
Section titled “Trajectory output”Output is saved in a timestamped subdirectory under the output directory (default: ./vally-results/).
Single run or JSONL mode:
results/└── 2025-01-15T10-30-00-005Z/ ├── results.jsonl ├── eval-results.md └── executor-session-logs/ └── <eval-name>/<model>/<stimulus-name>/trial-0/ ├── metadata.json └── events.jsonl (best-effort, may be absent)The executor-session-logs/ directory contains per-run session logs. metadata.json is always written and includes a logSource field ("native", "raw-event-fallback", or "none") indicating what was captured. events.jsonl is best-effort — it is present when the executor emits native session state or when raw SDK events were captured as a fallback, but may be absent for executors that don’t emit events.
Multi-trial (--runs K where K ≥ 2):
results/└── 2025-01-15T10-30-00-005Z/ ├── results.jsonl ├── eval-results.md └── executor-session-logs/ └── <eval-name>/<model>/<stimulus-name>/ ├── trial-0/ │ ├── events.jsonl │ └── metadata.json └── trial-1/ ├── events.jsonl └── metadata.jsonMulti-trial mode
Section titled “Multi-trial mode”When --runs is ≥ 2 (or config.runs is ≥ 2 in the eval spec), the eval command runs each stimulus multiple times and aggregates results:
vally eval --eval-spec eval.yaml --runs 5 --verboseOutput includes per-trial progress and aggregate metrics:
━━━ basic-test-generation (5 trials) ━━━ Write unit tests for this function: function add(a, b)...
Trial 1 ✔ 8.3s Trial 2 ✔ 7.1s Trial 3 ✘ 9.2s Trial 4 ✔ 6.8s Trial 5 ✔ 8.0s
pass@5: 99.2% pass^5: 41.0% pass rate: 4/5 (80.0%) ⚠ flaky (20.0% minority outcomes)Trials within a stimulus run sequentially. Parallelism across different stimuli is controlled by --workers.
Three execution modes
Section titled “Three execution modes”runs value | Mode | Behavior |
|---|---|---|
0 | Lint-only | Skip execution entirely (print “Skipped”) |
1 | Single run | Execute once, grade, report (original behavior) |
≥ 2 | Multi-trial | Run K times, grade each, aggregate with pass@k/pass^k |
Multi-model mode
Section titled “Multi-model mode”Pass a comma-separated list of models to run every eval against each model independently:
vally eval --eval-spec eval.yaml --model gpt-5.5,claude-opus-4.8Each eval is executed once per model. With 3 evals and 2 models, that’s 6 total runs. Results are reported separately per model — the console shows which model each eval ran against:
✔ my-eval [gpt-5.5] all graders passed ✘ my-eval [claude-opus-4.8] grader(s) failedThe exit code is 0 only when all model variants pass.
Workspace preservation
Section titled “Workspace preservation”--workspace <path> cannot be combined with multiple models in a single command. The workspace path is keyed by (variant, stimulus) and does not include the model, so two models running the same eval would both claim the same workspace directory, which the planner rejects as a collision:
Workspace path collision at '<workspace>/main/<stimulus>': claimed by'... × gpt-5.5' and '... × claude-opus-4.8'. Rename one of the stimulior use distinct --workspace roots.If you need preserved workspaces with multi-model, run each model separately with its own --workspace root:
vally eval --eval-spec eval.yaml --model gpt-5.5 --workspace ./ws/gpt-5.5vally eval --eval-spec eval.yaml --model claude-opus-4.8 --workspace ./ws/claude-opus-4.8Combining with other options
Section titled “Combining with other options”Multi-model works with --runs, --suite, --tag, and --workers.
# 2 models × 5 trials per stimulusvally eval --eval-spec eval.yaml \ --model gpt-5.5,claude-opus-4.8 \ --runs 5
# Suite + multi-modelvally eval --suite regression --model gpt-5.5,claude-opus-4.8