CLI: eval

Usage

vally eval [options]

Description

Run stimuli against an agent and grade the results. Stimuli come from one of three sources:

--eval-spec / -e — load stimuli from one or more eval spec files.
--suite — run a named suite of evals defined in .vally.yaml.
--tag — discover and filter eval stimuli by tag.

Options

Flag	Type	Default	Description
`--eval-spec, -e <path>`	string	—	Path to eval spec file (repeatable)
`--skill-dir <path>`	string	—	Opt-in: discover and load skills from this directory (same detection as `vally lint <path>`) for the whole run. A missing/non-directory path fails the command. Omit it to load only skills specified in `environment.skills`.
`--work-dir <path>`	string	cwd	Working directory for agent runs
`--model <model>`	string	From eval spec	Model to use for agent execution. Comma-separated list runs evals against each model (e.g., `--model gpt-5.5,claude-opus-4.8`).
`--timeout <duration>`	duration	`2m`	Timeout per run (e.g. `5m`, `300s`, `30000ms`). Unit suffix required. Falls back to `defaults.timeout` from the eval spec, then to 2m.
`--threshold <number>`	number	From eval spec	Override the eval scoring threshold for this command invocation. Must be between `0` and `1`. When no threshold is configured, evals use the binary grader pass/fail verdict.
`--judge-model <model>`	string	From eval spec	Model for LLM judge graders (overrides `defaults.judge_model` in the eval spec)
`--judge-reasoning-effort <level>`	string	From eval spec	Reasoning effort for the judge model: `low`, `medium`, `high`, or `xhigh` (overrides `defaults.judge_reasoning_effort` in the eval spec)
`--grader-plugin <specifier>`	string	—	Grader plugin to load (npm package name or local path). Registers external graders before validation and execution.
`--executor-plugin <specifier>`	string	—	Executor plugin to load (npm package name or local path). Repeatable — pass multiple flags when running eval specs that select different executors via their `defaults.executor` field.
`--backend <name-or-package>`	string	`local`	Backend that controls where trials run: a built-in name such as `local`, or a backend plugin package/path that registers exactly one backend.
`--backend-args <key=value>`	string	—	Argument passed to the selected backend. Repeatable; repeated keys are collected as arrays. Requires `--backend`.
`--eval-plugin <specifier>`	string	—	Eval provider plugin to load (npm package name or local path). Requires `--eval-spec`; cannot be combined with `--suite`.
`--otlp-endpoint <url>`	string	—	Export Copilot SDK OpenTelemetry traces, including Vally trial parent spans, to an OTLP/HTTP collector. Must be an `http(s)` URL with no embedded credentials (supply auth out-of-band).
`--output-dir <path>`	string	`./vally-results`	Parent directory for eval outputs. Each run is written to a timestamped subdirectory such as `<output-dir>/<timestamp>/`, containing files like trajectories and eval outcomes.
`--output <format>`	`jsonl`	—	Output format. `jsonl` emits one JSON object per stimulus to stdout.
`--junit`	boolean	`false`	Write a JUnit XML report (`eval-results.junit.xml`) to the run’s output directory.
`--workspace <path>`	string	—	Workspace directory for the agent (preserved after run).
`--runs <n>`	number	From eval spec	Number of trials per stimulus (overrides `defaults.runs`). ≥ 2 enables multi-trial mode with pass@k metrics.
`--max-retries <n>`	number	`2`	Max retries per stimulus on retryable errors (timeouts, rate limits, and transient connection failures). Each retry uses a fresh workspace with exponential backoff. Set to 0 to disable.
`--skip-grade`	boolean	`false`	Run agents without grading (grade later with the `grade` command)
`--skip-validate`	boolean	`false`	Skip eval spec validation (validation runs automatically by default)
`--workers <n>`	number	`5`	Number of concurrent stimulus sessions
`--shutdown-timeout <duration>`	duration	`20s`	Max time to wait for executors to shut down during cleanup, including on Ctrl+C (e.g. `5s`, `500ms`).
`--suite <name>`	string	—	Run a named suite of evals from `.vally.yaml`. Cannot be combined with `--eval-spec`.
`--tag <key=values>`	string	—	Filter stimuli by tag (repeatable, e.g., `--tag cost=free,low`).
`--param <key=value>`	string	—	Set a param value (repeatable, e.g. `--param MODEL=gpt-4o`). Overrides eval param files and `.vally.yaml`.
`--verbose`	boolean	`false`	Increase logging verbosity and show full agent output

Exit codes

Code	Meaning
`0`	All eval verdicts passed, or verdict failures were suppressed by an override such as `--threshold 0`
`1`	One or more eval verdicts failed, or an execution/tooling error occurred

When a threshold is configured via --threshold or scoring.threshold, the eval verdict is based on the aggregate score meeting that threshold. If no threshold is configured, the command uses the binary grader pass/fail verdict. Execution and tooling errors always exit with code 1, even when --threshold 0 is used.

Parameter resolution

When an eval file contains ${KEY} placeholders, values are resolved from multiple sources in this precedence order (highest wins):

CLI — --param KEY=value flags
Param file — eval.params.yaml (eval-level) or .vally.yaml params: (suite-level)
Inline defaults — ${KEY=fallback} in the YAML itself

To include a literal ${...} in your eval (e.g., a bash variable or another template syntax), use the double-dollar escape: $${HOME} produces ${HOME} in the output without being treated as a param placeholder.

Skills

A run is skill-free by default. Skills load only when you opt in through one of two explicit mechanisms:

environment.skills (recommended) — list skill directories in an eval spec’s environment block. A top-level block applies to all stimuli in the spec; an individual stimulus can declare its own. Paths resolve relative to the eval file. Each listed directory is copied into the agent’s isolated workspace, so the agent sees exactly the declared skills and nothing else. A missing directory, a directory without a SKILL.md, or an unparseable SKILL.md fails the run fast rather than silently loading nothing.
--skill-dir <path> (whole run, opt-in) — discover and load skills from <path> (the same detection vally lint <path> uses) for all stimuli in the run. Useful for ad-hoc “load this folder of skills” runs. A missing or non-directory path fails the command.

The two mechanisms don’t stack: when environment.skills is declared (in an eval spec, applying to all its stimuli, or on an individual stimulus), that list is the complete skill set and the --skill-dir base is ignored (replace, not merge). Stimuli with no environment.skills fall back to the --skill-dir set.

Examples

With eval spec

# Run all stimuli from eval.yaml
vally eval \
  --eval-spec eval.yaml \
  --skill-dir ./skills/test-writer \
  --output-dir ./results \
  --model gpt-5.5

# Run multiple eval spec files
vally eval -e auth.eval.yaml -e perf.eval.yaml \
  --skill-dir ./skills/test-writer

# Compare models side-by-side
vally eval \
  --eval-spec eval.yaml \
  --model gpt-5.5,claude-opus-4.8 \
  --output-dir ./results

# Keep workflow steps running for poor eval performance, but still fail on execution errors
vally eval \
  --eval-spec eval.yaml \
  --threshold 0

Capture without grading

# Capture trajectories now, grade later
vally eval \
  --eval-spec eval.yaml \
  --skip-grade \
  --output jsonl > outcomes.jsonl

# Later: grade with updated graders
vally grade --eval-spec eval.yaml < outcomes.jsonl

Export OpenTelemetry to an OTLP collector

vally eval \
  --eval-spec eval.yaml \
  --otlp-endpoint http://localhost:4318

In OTLP mode, Vally trial/attempt spans and Copilot SDK runtime spans are exported to the same collector. Each logical trial is one trace: a vally.trial parent span with one child vally.attempt span per attempt (executor runtime spans nest under the attempt). Filter traces by the vally.trial span and its vally.* attributes to inspect a specific trial. OTLP mode does not write local span files.

Preserve workspace

# Keep agent's workspace for inspection
vally eval \
  --eval-spec eval.yaml \
  --workspace ./debug-workspace \
  --verbose

Trajectory output

Output is saved in a timestamped subdirectory under the output directory (default: ./vally-results/).

Single run or JSONL mode:

results/
└── 2025-01-15T10-30-00-005Z/
    ├── results.jsonl
    ├── eval-results.md
    ├── otel-spans.jsonl       (absent when --otlp-endpoint specified)
    └── <eval-name>/<stimulus-name>/<model>/0/
        ├── metadata.json
        ├── events.jsonl   (best-effort, may be absent)
        └── artifacts/     (present only when the stimulus configures `artifacts`)

Session logs are written per (eval, stimulus, model, trial) under the run directory in <eval>/<stimulus>/<model>/<trial>/ directories. metadata.json is always written and includes a logSource field ("native", "raw-event-fallback", or "none") indicating what was captured. events.jsonl is best-effort — it is present when the executor emits native session state or when raw SDK events were captured as a fallback, but may be absent for executors that don’t emit events. When a stimulus configures artifacts, the copied files are placed in an artifacts/ subdirectory of that same trial directory.

Multi-trial (--runs K where K ≥ 2):

results/
└── 2025-01-15T10-30-00-005Z/
    ├── results.jsonl
    ├── eval-results.md
    ├── otel-spans.jsonl       (absent when --otlp-endpoint specified)
    └── <eval-name>/<stimulus-name>/<model>/
        ├── 0/
        │   ├── events.jsonl
        │   ├── metadata.json
        │   └── artifacts/   (present only when the stimulus configures `artifacts`)
        └── 1/
            ├── events.jsonl
            ├── metadata.json
            └── artifacts/

Multi-trial mode

When --runs is ≥ 2 (or defaults.runs is ≥ 2 in the eval spec), the eval command runs each stimulus multiple times and aggregates results:

vally eval --eval-spec eval.yaml --runs 5 --verbose

Output includes per-trial progress and aggregate metrics:

━━━ basic-test-generation (5 trials) ━━━
  Write unit tests for this function: function add(a, b)...

  Trial 1  ✔ 8.3s
  Trial 2  ✔ 7.1s
  Trial 3  ✘ 9.2s
  Trial 4  ✔ 6.8s
  Trial 5  ✔ 8.0s

  pass@5: 99.2%  pass^5: 41.0%  pass rate: 4/5 (80.0%)
  ⚠ flaky (20.0% minority outcomes)

Trials within a stimulus run sequentially. Parallelism across different stimuli is controlled by --workers.

Three execution modes

`runs` value	Mode	Behavior
`0`	Lint-only	Skip execution entirely (print “Skipped”)
`1`	Single run	Execute once, grade, report (original behavior)
`≥ 2`	Multi-trial	Run K times, grade each, aggregate with pass@k/pass^k

Multi-model mode

Pass a comma-separated list of models to run every eval against each model independently:

vally eval --eval-spec eval.yaml --model gpt-5.5,claude-opus-4.8

Each eval is executed once per model. With 3 evals and 2 models, that’s 6 total runs. Results are reported separately per model — the console shows which model each eval ran against:

  ✔ my-eval [gpt-5.5]  all graders passed
  ✘ my-eval [claude-opus-4.8]  grader(s) failed

The exit code is 0 only when all model variants pass.

Workspace preservation

--workspace <path> cannot be combined with multiple models in a single command. The workspace path is keyed by (variant, stimulus) and does not include the model, so two models running the same eval would both claim the same workspace directory, which the planner rejects as a collision:

Workspace path collision at '<workspace>/main/<stimulus>': claimed by
'... × gpt-5.5' and '... × claude-opus-4.8'. Rename one of the stimuli
or use distinct --workspace roots.

If you need preserved workspaces with multi-model, run each model separately with its own --workspace root:

vally eval --eval-spec eval.yaml --model gpt-5.5 --workspace ./ws/gpt-5.5
vally eval --eval-spec eval.yaml --model claude-opus-4.8 --workspace ./ws/claude-opus-4.8

Combining with other options

Multi-model works with --runs, --suite, --tag, and --workers.

# 2 models × 5 trials per stimulus
vally eval --eval-spec eval.yaml \
  --model gpt-5.5,claude-opus-4.8 \
  --runs 5

# Suite + multi-model
vally eval --suite regression --model gpt-5.5,claude-opus-4.8