Experiment File Schema

An experiment file declares a controlled comparison of multiple eval runs that differ along a declared axis. The CLI reads it with vally experiment run and produces one output directory per variant.

Everything outside the vary paths is required to be identical across variants; differences elsewhere in the resolved config are a hard error at plan time.

Minimal example

name: model-comparison
evals:
  - evals/auth/eval.yaml
  - evals/search/eval.yaml

vary: [/defaults/model]

baseline: gpt-5.5
variants:
  gpt-5.5:
    overrides:
      model: gpt-5.5
  claude-sonnet-4.6:
    overrides:
      model: claude-sonnet-4.6

Top-level fields

Field	Type	Required	Description
`name`	string	Yes	Human-readable experiment name. Surfaces in reports, JSONL records, and the analytics server.
`evals`	string[]	Yes	Eval spec files to include. Supports globs. Paths resolve relative to the experiment file’s directory. Individual patterns that match zero files are skipped silently; an error is raised only when the combined patterns match zero files. This is the source of truth for the eval set — the CLI’s `--eval-filter` flag can only narrow it (intersection), never add evals it does not declare.
`vary`	string[]	Yes²	JSON Pointer paths that may differ between variants. Subtree prefix semantics: `/environment/skills` authorizes any change at or below that path. Differences outside `vary` are rejected as drift. Required with `variants`; omit with `matrix` — in matrix mode it is derived from the axis paths (and rejected if present).
`baseline`	string \| map	Yes	The control variant. With `variants`, a string naming a key in `variants`. With `matrix`, a map of axis name → value label selecting one generated cell. At least two variants are required. Currently used to validate the experiment file and flag the baseline plan in dry-run output; comparison reporting that consumes this field is not yet wired.
`variants`	map<string, VariantOverride>	Yes¹	Named variants. Each entry’s overrides merge on top of the eval spec’s defaults and environment before per-stimulus environment resolution, so variant changes can influence what stimuli see.
`matrix`	map<string, Axis>	Yes¹	Declarative cross-product of named axes that expands into `variants`. Mutually exclusive with `variants`. See Matrix (cross-product) variants.
`overrides`	EvalDefaults	No	Experiment-wide execution overrides applied to every variant. Precedence: CLI flags > experiment overrides > eval defaults.
`execution`	object	No	Runner mechanics (excluded from the config hash). Currently only `workers: <integer>` is honored; `--workers` on the CLI takes precedence.
`filter`	map	No	Stimulus tag filter applied consistently to every variant. Not yet wired into `vally experiment run` — parsed but currently inert.
`grader_plugins`	string[]	No	Grader plugin specifiers (npm packages or local paths). Not yet wired.
`executor_plugins`	string[]	No	Executor plugin specifiers (npm packages or local paths). Not yet wired.
`eval_plugin`	string	No	Single eval-provider plugin specifier. Not yet wired.

¹ Exactly one of variants or matrix is required — they are mutually exclusive. ² Required with variants; must be omitted with matrix (derived from the axis paths).

Variant overrides

Each variant entry may override two fields on the resolved eval spec:

variants:
  my-variant:
    overrides: # eval-level defaults (model, runs, timeout, etc.)
      model: gpt-5.5
    environment: # environment fields (skills, mcpServers, git, files, commands, commandTimeout)
      skills: ["plugins/${eval.parent}"]

Other eval fields (scoring, graders, individual stimuli) are not overridable in v1.

Merge semantics inside `environment`

Scalars: replace.
Arrays (e.g. skills, files, commands): replace entirely. They do not concatenate.
Maps (e.g. mcpServers): deep-merge by key.
null on a map entry (e.g. mcpServers.foo: null): delete that entry from the inherited config.
environment: null: remove the entire inherited environment.

Matrix (cross-product) variants

Sweeping several axes at once (e.g. models × skill sets) by hand-writing every variants entry is verbose and error-prone. A matrix block declares the axes and Vally expands their full cross-product into the variants map for you — everything downstream (drift checking, hashing, per-variant output, --variant, server ingest) sees ordinary variants and never has to know matrix existed.

name: model-x-skills
evals:
  - evals/auth/eval.yaml

matrix:
  model:
    path: /defaults/model
    values: [gpt-5.5, claude-sonnet-4.6]
  skills:
    path: /environment/skills
    values:
      - none: []
      - matched: ["plugins/${eval.parent}/skills/*"]

baseline:
  model: gpt-5.5
  skills: none

This expands to four variants — model=gpt-5.5,skills=none, model=gpt-5.5,skills=matched, model=claude-sonnet-4.6,skills=none, and model=claude-sonnet-4.6,skills=matched — with vary derived as [/defaults/model, /environment/skills].

Axes

Each key under matrix is an axis. An axis has:

path — a JSON Pointer into the resolved config, using the same drift-space as vary. Each axis path becomes a vary entry, so do not also list vary yourself; it is derived (and rejected if present). An axis replaces one whole field, so the supported paths are the wholesale-replaceable subset of the variant-override fields: /defaults/<field> (runs, timeout, model, executor, judge_model) and /environment/<field> for skills, git, files, and commands. mcpServers is the exception — it deep-merges by key rather than replacing wholesale, so target a single entry with /environment/mcpServers/<name> instead of the whole /environment/mcpServers map (which an axis cannot replace).
values — a non-empty list of the values that axis takes. Each value is either:
- a scalar (gpt-5.5, 3, true), whose label is its stringified form, or
- a single-key { label: value } map ({ matched: [...] }), letting you give a readable label to a non-scalar value such as a list or object.

Generated variant names

Each cell’s name joins its axis labels in declaration order: axis1=label1,axis2=label2,…. These names are stable and are exactly what you pass to --variant and what appears as the variant value in JSONL output and the analytics server. Axis names and value labels must match ^[A-Za-z0-9._-]+$ (use the { label: value } form to give an otherwise-illegal value a clean label); axis names may not be purely numeric.

Baseline

In matrix mode baseline is a map of axis name → value label that selects exactly one generated cell (every axis must be listed, with a label that exists on that axis). In the example above it resolves to model=gpt-5.5,skills=none.

Constraints

The product must yield at least 2 and at most 256 cells.
matrix and variants are mutually exclusive — use one or the other.
${eval.…} interpolation works inside axis values, just as in hand-written variant overrides.

Excluding individual cells from a grid (ragged matrices) is not supported in v1; enumerate variants explicitly if you need to skip combinations.

Interpolation variables

Values inside variant overrides support ${eval.…} interpolation, derived from the eval file being processed:

Variable	Example value
`${eval.name}`	The eval spec’s `name` field
`${eval.path}`	`tests/dotnet-maui/maui-theming/eval.vally.yaml`
`${eval.dir}`	`tests/dotnet-maui/maui-theming`
`${eval.basename}`	`eval.vally.yaml`
`${eval.parent}`	`maui-theming` (last directory segment)
`${eval.grandparent}`	`dotnet-maui` (second-to-last directory segment)

Useful for templating skill paths against the eval’s location:

variants:
  matched-skill:
    environment:
      skills: ["plugins/${eval.grandparent}/skills/${eval.parent}"]

Output

vally experiment run writes one subdirectory per variant under the run’s timestamped output directory. See the vally experiment CLI reference for the layout and the JSONL record shape.