Skip to content

CLI: eval

Terminal window
vally eval [options]

Run stimuli against an agent and grade the results. Stimuli come from one of three sources:

  1. --eval-spec / -e — load stimuli from one or more eval spec files.
  2. --suite — run a named suite of evals defined in .vally.yaml.
  3. --tag — discover and filter eval stimuli by tag.
FlagTypeDefaultDescription
--eval-spec, -e <path>stringPath to eval spec file (repeatable)
--skill-dir <path>stringOpt-in: discover and load skills from this directory (same detection as vally lint <path>) for the whole run. A missing/non-directory path fails the command. Omit it to load only skills specified in environment.skills.
--work-dir <path>stringcwdWorking directory for agent runs
--model <model>stringFrom eval specModel to use for agent execution. Comma-separated list runs evals against each model (e.g., --model gpt-5.5,claude-opus-4.8).
--timeout <duration>duration2mTimeout per run (e.g. 5m, 300s, 30000ms). Unit suffix required. Falls back to config.timeout from the eval spec, then to 2m.
--threshold <number>numberFrom eval specOverride the eval scoring threshold for this command invocation. Must be between 0 and 1. When no threshold is configured, evals use the binary grader pass/fail verdict.
--judge-model <model>stringFrom eval specModel for LLM judge graders (overrides config.judge_model in the eval spec)
--grader-plugin <specifier>stringGrader plugin to load (npm package name or local path). Registers external graders before validation and execution.
--executor-plugin <specifier>stringExecutor plugin to load (npm package name or local path). Repeatable — pass multiple flags when running eval specs that select different executors via their config.executor field.
--eval-plugin <specifier>stringEval provider plugin to load (npm package name or local path). Requires --eval-spec; cannot be combined with --suite.
--output-dir <path>string./vally-resultsParent directory for eval outputs. Each run is written to a timestamped subdirectory such as <output-dir>/<timestamp>/, containing files like trajectories and eval outcomes.
--output <format>jsonlOutput format. jsonl emits one JSON object per stimulus to stdout.
--junitbooleanfalseWrite a JUnit XML report (eval-results.junit.xml) to the run’s output directory.
--workspace <path>stringWorkspace directory for the agent (preserved after run).
--runs <n>numberFrom eval specNumber of trials per stimulus (overrides config.runs). ≥ 2 enables multi-trial mode with pass@k metrics.
--max-retries <n>number2Max retries per stimulus on retryable errors (timeouts, rate limits, and transient connection failures). Each retry uses a fresh workspace with exponential backoff. Set to 0 to disable.
--skip-gradebooleanfalseRun agents without grading (grade later with the grade command)
--skip-validatebooleanfalseSkip eval spec validation (validation runs automatically by default)
--workers <n>number5Number of concurrent stimulus sessions
--shutdown-timeout <duration>duration5sMax time to wait for executors to shut down during cleanup, including on Ctrl+C (e.g. 5s, 500ms).
--suite <name>stringRun a named suite of evals from .vally.yaml. Cannot be combined with --eval-spec.
--tag <key=values>stringFilter stimuli by tag (repeatable, e.g., --tag cost=free,low).
--verbosebooleanfalseIncrease logging verbosity and show full agent output
CodeMeaning
0All eval verdicts passed, or verdict failures were suppressed by an override such as --threshold 0
1One or more eval verdicts failed, or an execution/tooling error occurred

When a threshold is configured via --threshold or scoring.threshold, the eval verdict is based on the aggregate score meeting that threshold. If no threshold is configured, the command uses the binary grader pass/fail verdict. Execution and tooling errors always exit with code 1, even when --threshold 0 is used.

A run is skill-free by default. Skills load only when you opt in through one of two explicit mechanisms:

  • environment.skills (recommended) — list skill directories in an eval spec’s environment block. A top-level block applies to all stimuli in the spec; an individual stimulus can declare its own. Paths resolve relative to the eval file. Each listed directory is copied into the agent’s isolated workspace, so the agent sees exactly the declared skills and nothing else. A missing directory, a directory without a SKILL.md, or an unparseable SKILL.md fails the run fast rather than silently loading nothing.
  • --skill-dir <path> (whole run, opt-in) — discover and load skills from <path> (the same detection vally lint <path> uses) for all stimuli in the run. Useful for ad-hoc “load this folder of skills” runs. A missing or non-directory path fails the command.

The two mechanisms don’t stack: when environment.skills is declared (in an eval spec, applying to all its stimuli, or on an individual stimulus), that list is the complete skill set and the --skill-dir base is ignored (replace, not merge). Stimuli with no environment.skills fall back to the --skill-dir set.

Terminal window
# Run all stimuli from eval.yaml
vally eval \
--eval-spec eval.yaml \
--skill-dir ./skills/test-writer \
--output-dir ./results \
--model gpt-5.5
# Run multiple eval spec files
vally eval -e auth.eval.yaml -e perf.eval.yaml \
--skill-dir ./skills/test-writer
# Compare models side-by-side
vally eval \
--eval-spec eval.yaml \
--model gpt-5.5,claude-opus-4.8 \
--output-dir ./results
# Keep workflow steps running for poor eval performance, but still fail on execution errors
vally eval \
--eval-spec eval.yaml \
--threshold 0
Terminal window
# Capture trajectories now, grade later
vally eval \
--eval-spec eval.yaml \
--skip-grade \
--output jsonl > outcomes.jsonl
# Later: grade with updated graders
vally grade --eval-spec eval.yaml < outcomes.jsonl
Terminal window
# Keep agent's workspace for inspection
vally eval \
--eval-spec eval.yaml \
--workspace ./debug-workspace \
--verbose

Output is saved in a timestamped subdirectory under the output directory (default: ./vally-results/).

Single run or JSONL mode:

results/
└── 2025-01-15T10-30-00-005Z/
├── results.jsonl
├── eval-results.md
└── executor-session-logs/
└── <eval-name>/<model>/<stimulus-name>/trial-0/
├── metadata.json
└── events.jsonl (best-effort, may be absent)

The executor-session-logs/ directory contains per-run session logs. metadata.json is always written and includes a logSource field ("native", "raw-event-fallback", or "none") indicating what was captured. events.jsonl is best-effort — it is present when the executor emits native session state or when raw SDK events were captured as a fallback, but may be absent for executors that don’t emit events.

Multi-trial (--runs K where K ≥ 2):

results/
└── 2025-01-15T10-30-00-005Z/
├── results.jsonl
├── eval-results.md
└── executor-session-logs/
└── <eval-name>/<model>/<stimulus-name>/
├── trial-0/
│ ├── events.jsonl
│ └── metadata.json
└── trial-1/
├── events.jsonl
└── metadata.json

When --runs is ≥ 2 (or config.runs is ≥ 2 in the eval spec), the eval command runs each stimulus multiple times and aggregates results:

Terminal window
vally eval --eval-spec eval.yaml --runs 5 --verbose

Output includes per-trial progress and aggregate metrics:

━━━ basic-test-generation (5 trials) ━━━
Write unit tests for this function: function add(a, b)...
Trial 1 ✔ 8.3s
Trial 2 ✔ 7.1s
Trial 3 ✘ 9.2s
Trial 4 ✔ 6.8s
Trial 5 ✔ 8.0s
pass@5: 99.2% pass^5: 41.0% pass rate: 4/5 (80.0%)
⚠ flaky (20.0% minority outcomes)

Trials within a stimulus run sequentially. Parallelism across different stimuli is controlled by --workers.

runs valueModeBehavior
0Lint-onlySkip execution entirely (print “Skipped”)
1Single runExecute once, grade, report (original behavior)
≥ 2Multi-trialRun K times, grade each, aggregate with pass@k/pass^k

Pass a comma-separated list of models to run every eval against each model independently:

Terminal window
vally eval --eval-spec eval.yaml --model gpt-5.5,claude-opus-4.8

Each eval is executed once per model. With 3 evals and 2 models, that’s 6 total runs. Results are reported separately per model — the console shows which model each eval ran against:

✔ my-eval [gpt-5.5] all graders passed
✘ my-eval [claude-opus-4.8] grader(s) failed

The exit code is 0 only when all model variants pass.

--workspace <path> cannot be combined with multiple models in a single command. The workspace path is keyed by (variant, stimulus) and does not include the model, so two models running the same eval would both claim the same workspace directory, which the planner rejects as a collision:

Workspace path collision at '<workspace>/main/<stimulus>': claimed by
'... × gpt-5.5' and '... × claude-opus-4.8'. Rename one of the stimuli
or use distinct --workspace roots.

If you need preserved workspaces with multi-model, run each model separately with its own --workspace root:

Terminal window
vally eval --eval-spec eval.yaml --model gpt-5.5 --workspace ./ws/gpt-5.5
vally eval --eval-spec eval.yaml --model claude-opus-4.8 --workspace ./ws/claude-opus-4.8

Multi-model works with --runs, --suite, --tag, and --workers.

Terminal window
# 2 models × 5 trials per stimulus
vally eval --eval-spec eval.yaml \
--model gpt-5.5,claude-opus-4.8 \
--runs 5
# Suite + multi-model
vally eval --suite regression --model gpt-5.5,claude-opus-4.8