CLI: experiment

Usage

vally experiment run <experiment-file> [options]
vally experiment merge <shard-dir>... -o <merged-dir>

Description

Resolve and execute an experiment file — a declarative spec that runs the same eval set across multiple variants (e.g. different models, skill sets, MCP server configurations) for controlled A/B comparison.

Every (eval × variant) combination flows through a single shared worker pool, so one slow variant does not block the others. Each variant gets its own output directory containing results.jsonl (one trial-result record per trial, with an experiment provenance block) and run-summary.jsonl. An unsharded run also writes a single experiment-level markdown report (report.md) alongside the per-variant subdirectories; a sharded run defers the report to vally experiment merge.

See Experiment File Schema for the YAML format.

Options

Flag	Type	Default	Description
`<experiment-file>`	path	—	Required positional argument. Path to the experiment YAML file.
`--variant <name>`	string	—	Run only the named variant. Useful for CI splits or partial reruns.
`--eval-filter <pattern>`	path/glob	—	Run only declared evals matching this path/glob. Repeatable. Narrows (intersects) the file’s `evals:` set — it can never add an eval the file did not declare. Each pattern must match at least one declared eval, or the run fails fast (so a stale filter can’t silently shrink a CI run).
`--output-dir <path>`	path	`./vally-experiment-results`	Directory for output files. A timestamped subdirectory is created inside.
`--workers <n>`	integer	`5`	Max concurrent trials across all variants.
`--backend <name-or-package>`	string	`local`	Backend that controls where trials run: a built-in name such as `local`, or a backend plugin package/path that registers exactly one backend.
`--backend-args <key=value>`	string	—	Argument passed to the selected backend. Repeatable; repeated keys are collected as arrays. Requires `--backend`.
`--shard <i/n>`	string	—	Run only shard i of n (e.g. `1/4`). Requires `--run-id`.
`--run-id <id>`	string	—	Stable run identifier shared across all shards of a run (a UUID is recommended). Required with `--shard`.
`--shard-strategy <specifier>`	string	`round-robin`	Shard selection strategy: a built-in (`round-robin`, `by-stimulus`) or a module specifier.
`--param <key=value>`	string	—	Set a param value (repeatable, e.g. `--param MODEL=gpt-4o`). Overrides eval param files and `.vally.yaml`. See `vally eval` — Parameter resolution for the full precedence chain.
`--dry-run`	flag	`false`	Resolve the experiment and print the plan without executing.
`--verbose`	flag	`false`	Increase console logging verbosity and show full agent output.

Exit codes

run:

Code	Meaning
`0`	All variants completed and every trial passed.
`1`	Any trial failed, the experiment file did not resolve, or `--variant` named an unknown variant.

merge:

Code	Meaning
`0`	The shards reconciled cleanly and every variant passed.
`1`	A validation check failed (missing/duplicate shard, mismatched run identity, incomplete partition, unfinished shard), or any variant failed in the merged result.

Output layout

<output-dir>/
└── 2026-06-09T20-41-10-885Z/         # timestamped run directory
    ├── report.md                      # experiment-level markdown report
    ├── <variant-a>/                    # one subdirectory per variant (named after the variant)
    │   ├── results.jsonl              # trial-result records + experiment block
    │   ├── run-summary.jsonl
    │   └── <eval>/<stimulus>/<model>/<trial>/   # preserved per-trial session logs
    │       ├── events.jsonl           # best-effort (may be absent)
    │       ├── metadata.json          # always written
    │       └── artifacts/             # copied stimulus artifacts (if configured)
    └── <variant-b>/
        ├── results.jsonl
        ├── run-summary.jsonl
        └── <eval>/<stimulus>/<model>/<trial>/
            └── metadata.json

Each results.jsonl line is a TrialResultRecord (the same shape vally eval writes) with an extra experiment field carrying the run ID, variant name, and content hashes of the eval and resolved config.

Like vally eval, each trial’s executor session log is preserved. Because one experiment run spans multiple variants, the tree is partitioned by variant first, and each variant directory holds its results.jsonl / run-summary.jsonl alongside the flattened per-trial session directories (<variant>/<eval>/<stimulus>/<model>/<trial>/); the metadata.json / events.jsonl semantics match vally eval, and configured stimulus artifacts are copied into an artifacts/ subdirectory of that same trial directory. In a sharded run the logs live inside that shard’s directory (<run-id>/shard-NN-of-NN/<variant>/...); vally experiment merge leaves them in place per shard rather than consolidating them.

Examples

# Run every variant
vally experiment run experiments/skill-comparison.yaml

# Run a single variant (e.g. for a CI matrix split)
vally experiment run experiments/skill-comparison.yaml --variant treatment

# Run only a subset of the declared evals (e.g. one plugin's evals per CI shard).
# The pattern is intersected with the file's evals: set, so the file's own
# exclusions still apply. A pattern matching no declared eval fails fast.
vally experiment run experiments/skill-comparison.yaml \
  --eval-filter 'tests/my-plugin/**/eval.vally.yaml'

# Run a single eval (targeted local run); repeat --eval-filter to add more.
vally experiment run experiments/skill-comparison.yaml \
  --eval-filter tests/my-plugin/some-eval/eval.vally.yaml

# Print the resolved plan without executing
vally experiment run experiments/skill-comparison.yaml --dry-run

# Custom output location
vally experiment run experiments/skill-comparison.yaml \
  --output-dir ./runs/skill-comparison

# Run a single cell of a matrix experiment (names are generated as axis=label,…)
vally experiment run experiments/model-x-skills.yaml \
  --variant "model=gpt-5.5,skills=none"

Sharded runs and merging

For large experiments you can split the work across N machines with --shard i/n. Each shard plans the whole run but executes only its slice of the trials, writing its output under a stable <run-id>/shard-<i>-of-<n>/ layout (the index and total are zero-padded to equal width, so the two-digit form only appears once n >= 10):

<output-dir>/
└── <run-id>/                         # stable, shared across all shards
    ├── shard-1-of-3/
    │   ├── plan-snapshot.json        # planned evals/stimuli/graders (machine-independent)
    │   ├── shard-manifest.json       # run + shard identity (planDigest, selected/completed keys)
    │   ├── <variant-a>/results.jsonl  # only this shard's trials
    │   └── <variant-b>/results.jsonl
    ├── shard-2-of-3/
    └── shard-3-of-3/

A sharded run writes no report.md — a single shard’s slice is incomplete, and its per-variant pass/fail is partial. Run vally experiment merge to reconcile the shards into one authoritative result.

vally experiment merge <shard-dir>... -o <merged-dir>

Merge reads every shard’s shard-manifest.json and plan-snapshot.json, validates that they describe the same run and partition it exactly, then concatenates the per-variant results.jsonl, regenerates the run summaries, and renders the final report.md. It is pure data wrangling — no agents are executed and no grading is performed (each shard graded its own trials during run). The merged report reproduces an unsharded run of the same spec apart from the render-time timestamp (and any diagnostics emitted during execution, which aren’t persisted per shard — see below).

Shard directories are passed explicitly — there is no auto-discovery, so a partial merge can’t happen by accident. Merge fails (exit 1) with a specific message if a shard is missing or duplicated, the shards disagree on run identity, a shard did not finish, or their slices don’t cover the planned run exactly.

`merge` options

Flag	Type	Default	Description
`<shard-dir>...`	path	—	Required. One or more shard output directories of a single run.
`-o, --output-dir`	path	—	Required. Destination directory for the merged artifacts.

Merged output layout

<merged-dir>/
├── report.md                  # authoritative experiment report
├── plan-snapshot.json
├── experiment-manifest.json   # promoted run identity (for a future `experiment compare`)
├── <variant-a>/
│   ├── results.jsonl          # all shards' trials, concatenated
│   └── run-summary.jsonl
└── <variant-b>/
    ├── results.jsonl
    └── run-summary.jsonl

Any other per-shard output (e.g. preserved workspaces or executor session logs) is left in place per-shard and is not consolidated into the merged directory.

Sharded example (CI matrix)

# Each CI job runs one shard, sharing a stable run id:
vally experiment run experiments/skill-comparison.yaml \
  --shard 1/3 --run-id "$RUN_ID" --output-dir ./out
# ... shards 2/3 and 3/3 run on other machines ...

# A final job merges the collected shard directories:
vally experiment merge \
  ./out/$RUN_ID/shard-1-of-3 \
  ./out/$RUN_ID/shard-2-of-3 \
  ./out/$RUN_ID/shard-3-of-3 \
  -o ./out/merged

If a shard is lost, re-run just that shard into the same --run-id and merge again — sharding is deterministic, so the re-run reproduces the same slice.

Workflow

Author an experiment YAML file declaring the eval set, the vary axis, and the named variants — or a matrix of axes that expands into them automatically. See Experiment File Schema.
Run vally experiment run <file> (optionally with --dry-run first to inspect the plan). To parallelize across machines, run each shard with --shard i/n --run-id <id>, then reconcile them with vally experiment merge.
Inspect per-variant results.jsonl directly, ingest the run directory with vally ingest, or browse it via vally serve.