Skip to content

Experiment File Schema

An experiment file declares a controlled comparison of multiple eval runs that differ along a declared axis. The CLI reads it with vally experiment run and produces one output directory per variant.

Everything outside the vary paths is required to be identical across variants; differences elsewhere in the resolved config are a hard error at plan time.

name: model-comparison
evals:
- evals/auth/eval.yaml
- evals/search/eval.yaml
vary: [/defaults/model]
baseline: gpt-5.5
variants:
gpt-5.5:
overrides:
model: gpt-5.5
claude-sonnet-4.6:
overrides:
model: claude-sonnet-4.6
FieldTypeRequiredDescription
namestringYesHuman-readable experiment name. Surfaces in reports, JSONL records, and the analytics server.
evalsstring[]YesEval spec files to include. Supports globs. Paths resolve relative to the experiment file’s directory. Individual patterns that match zero files are skipped silently; an error is raised only when the combined patterns match zero files.
varystring[]Yes²JSON Pointer paths that may differ between variants. Subtree prefix semantics: /environment/skills authorizes any change at or below that path. Differences outside vary are rejected as drift. Required with variants; omit with matrix — in matrix mode it is derived from the axis paths (and rejected if present).
baselinestring | mapYesThe control variant. With variants, a string naming a key in variants. With matrix, a map of axis name → value label selecting one generated cell. At least two variants are required. Currently used to validate the experiment file and flag the baseline plan in dry-run output; comparison reporting that consumes this field is not yet wired.
variantsmap<string, VariantOverride>Yes¹Named variants. Each entry’s overrides merge on top of the eval spec’s defaults and environment before per-stimulus environment resolution, so variant changes can influence what stimuli see.
matrixmap<string, Axis>Yes¹Declarative cross-product of named axes that expands into variants. Mutually exclusive with variants. See Matrix (cross-product) variants.
overridesEvalDefaultsNoExperiment-wide execution overrides applied to every variant. Precedence: CLI flags > experiment overrides > eval defaults.
executionobjectNoRunner mechanics (excluded from the config hash). Currently only workers: <integer> is honored; --workers on the CLI takes precedence.
filtermapNoStimulus tag filter applied consistently to every variant. Not yet wired into vally experiment run — parsed but currently inert.
grader_pluginsstring[]NoGrader plugin specifiers (npm packages or local paths). Not yet wired.
executor_pluginsstring[]NoExecutor plugin specifiers (npm packages or local paths). Not yet wired.
eval_pluginstringNoSingle eval-provider plugin specifier. Not yet wired.

¹ Exactly one of variants or matrix is required — they are mutually exclusive. ² Required with variants; must be omitted with matrix (derived from the axis paths).

Each variant entry may override two fields on the resolved eval spec:

variants:
my-variant:
overrides: # eval-level defaults (model, runs, timeout, etc.)
model: gpt-5.5
environment: # environment fields (skills, mcpServers, git, files, commands)
skills: ["plugins/${eval.parent}"]

Other eval fields (scoring, graders, individual stimuli) are not overridable in v1.

  • Scalars: replace.
  • Arrays (e.g. skills, files, commands): replace entirely. They do not concatenate.
  • Maps (e.g. mcpServers): deep-merge by key.
  • null on a map entry (e.g. mcpServers.foo: null): delete that entry from the inherited config.
  • environment: null: remove the entire inherited environment.

Sweeping several axes at once (e.g. models × skill sets) by hand-writing every variants entry is verbose and error-prone. A matrix block declares the axes and Vally expands their full cross-product into the variants map for you — everything downstream (drift checking, hashing, per-variant output, --variant, server ingest) sees ordinary variants and never has to know matrix existed.

name: model-x-skills
evals:
- evals/auth/eval.yaml
matrix:
model:
path: /defaults/model
values: [gpt-5.5, claude-sonnet-4.6]
skills:
path: /environment/skills
values:
- none: []
- matched: ["plugins/${eval.parent}/skills/*"]
baseline:
model: gpt-5.5
skills: none

This expands to four variants — model=gpt-5.5,skills=none, model=gpt-5.5,skills=matched, model=claude-sonnet-4.6,skills=none, and model=claude-sonnet-4.6,skills=matched — with vary derived as [/defaults/model, /environment/skills].

Each key under matrix is an axis. An axis has:

  • path — a JSON Pointer into the resolved config, using the same drift-space as vary. Each axis path becomes a vary entry, so do not also list vary yourself; it is derived (and rejected if present). An axis replaces one whole field, so the supported paths are the wholesale-replaceable subset of the variant-override fields: /defaults/<field> (runs, timeout, model, executor, judge_model) and /environment/<field> for skills, git, files, and commands. mcpServers is the exception — it deep-merges by key rather than replacing wholesale, so target a single entry with /environment/mcpServers/<name> instead of the whole /environment/mcpServers map (which an axis cannot replace).
  • values — a non-empty list of the values that axis takes. Each value is either:
    • a scalar (gpt-5.5, 3, true), whose label is its stringified form, or
    • a single-key { label: value } map ({ matched: [...] }), letting you give a readable label to a non-scalar value such as a list or object.

Each cell’s name joins its axis labels in declaration order: axis1=label1,axis2=label2,…. These names are stable and are exactly what you pass to --variant and what appears as the variant value in JSONL output and the analytics server. Axis names and value labels must match ^[A-Za-z0-9._-]+$ (use the { label: value } form to give an otherwise-illegal value a clean label); axis names may not be purely numeric.

In matrix mode baseline is a map of axis name → value label that selects exactly one generated cell (every axis must be listed, with a label that exists on that axis). In the example above it resolves to model=gpt-5.5,skills=none.

  • The product must yield at least 2 and at most 256 cells.
  • matrix and variants are mutually exclusive — use one or the other.
  • ${eval.…} interpolation works inside axis values, just as in hand-written variant overrides.

Excluding individual cells from a grid (ragged matrices) is not supported in v1; enumerate variants explicitly if you need to skip combinations.

Values inside variant overrides support ${eval.…} interpolation, derived from the eval file being processed:

VariableExample value
${eval.name}The eval spec’s name field
${eval.path}tests/dotnet-maui/maui-theming/eval.vally.yaml
${eval.dir}tests/dotnet-maui/maui-theming
${eval.basename}eval.vally.yaml
${eval.parent}maui-theming (last directory segment)
${eval.grandparent}dotnet-maui (second-to-last directory segment)

Useful for templating skill paths against the eval’s location:

variants:
matched-skill:
environment:
skills: ["plugins/${eval.grandparent}/skills/${eval.parent}"]

vally experiment run writes one subdirectory per variant under the run’s timestamped output directory. See the vally experiment CLI reference for the layout and the JSONL record shape.