Experiment File Schema
An experiment file declares a controlled comparison of multiple eval runs that differ along a declared axis. The CLI reads it with vally experiment run and produces one output directory per variant.
Everything outside the vary paths is required to be identical across variants; differences elsewhere in the resolved config are a hard error at plan time.
Minimal example
Section titled “Minimal example”name: model-comparisonevals: - evals/auth/eval.yaml - evals/search/eval.yaml
vary: [/defaults/model]
baseline: gpt-5.5variants: gpt-5.5: overrides: model: gpt-5.5 claude-sonnet-4.6: overrides: model: claude-sonnet-4.6Top-level fields
Section titled “Top-level fields”| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Human-readable experiment name. Surfaces in reports, JSONL records, and the analytics server. |
evals | string[] | Yes | Eval spec files to include. Supports globs. Paths resolve relative to the experiment file’s directory. Individual patterns that match zero files are skipped silently; an error is raised only when the combined patterns match zero files. |
vary | string[] | Yes² | JSON Pointer paths that may differ between variants. Subtree prefix semantics: /environment/skills authorizes any change at or below that path. Differences outside vary are rejected as drift. Required with variants; omit with matrix — in matrix mode it is derived from the axis paths (and rejected if present). |
baseline | string | map | Yes | The control variant. With variants, a string naming a key in variants. With matrix, a map of axis name → value label selecting one generated cell. At least two variants are required. Currently used to validate the experiment file and flag the baseline plan in dry-run output; comparison reporting that consumes this field is not yet wired. |
variants | map<string, VariantOverride> | Yes¹ | Named variants. Each entry’s overrides merge on top of the eval spec’s defaults and environment before per-stimulus environment resolution, so variant changes can influence what stimuli see. |
matrix | map<string, Axis> | Yes¹ | Declarative cross-product of named axes that expands into variants. Mutually exclusive with variants. See Matrix (cross-product) variants. |
overrides | EvalDefaults | No | Experiment-wide execution overrides applied to every variant. Precedence: CLI flags > experiment overrides > eval defaults. |
execution | object | No | Runner mechanics (excluded from the config hash). Currently only workers: <integer> is honored; --workers on the CLI takes precedence. |
filter | map | No | Stimulus tag filter applied consistently to every variant. Not yet wired into vally experiment run — parsed but currently inert. |
grader_plugins | string[] | No | Grader plugin specifiers (npm packages or local paths). Not yet wired. |
executor_plugins | string[] | No | Executor plugin specifiers (npm packages or local paths). Not yet wired. |
eval_plugin | string | No | Single eval-provider plugin specifier. Not yet wired. |
¹ Exactly one of variants or matrix is required — they are mutually exclusive.
² Required with variants; must be omitted with matrix (derived from the axis paths).
Variant overrides
Section titled “Variant overrides”Each variant entry may override two fields on the resolved eval spec:
variants: my-variant: overrides: # eval-level defaults (model, runs, timeout, etc.) model: gpt-5.5 environment: # environment fields (skills, mcpServers, git, files, commands) skills: ["plugins/${eval.parent}"]Other eval fields (scoring, graders, individual stimuli) are not overridable in v1.
Merge semantics inside environment
Section titled “Merge semantics inside environment”- Scalars: replace.
- Arrays (e.g.
skills,files,commands): replace entirely. They do not concatenate. - Maps (e.g.
mcpServers): deep-merge by key. nullon a map entry (e.g.mcpServers.foo: null): delete that entry from the inherited config.environment: null: remove the entire inherited environment.
Matrix (cross-product) variants
Section titled “Matrix (cross-product) variants”Sweeping several axes at once (e.g. models × skill sets) by hand-writing every variants entry is verbose and error-prone. A matrix block declares the axes and Vally expands their full cross-product into the variants map for you — everything downstream (drift checking, hashing, per-variant output, --variant, server ingest) sees ordinary variants and never has to know matrix existed.
name: model-x-skillsevals: - evals/auth/eval.yaml
matrix: model: path: /defaults/model values: [gpt-5.5, claude-sonnet-4.6] skills: path: /environment/skills values: - none: [] - matched: ["plugins/${eval.parent}/skills/*"]
baseline: model: gpt-5.5 skills: noneThis expands to four variants — model=gpt-5.5,skills=none, model=gpt-5.5,skills=matched, model=claude-sonnet-4.6,skills=none, and model=claude-sonnet-4.6,skills=matched — with vary derived as [/defaults/model, /environment/skills].
Each key under matrix is an axis. An axis has:
path— a JSON Pointer into the resolved config, using the same drift-space asvary. Each axis path becomes avaryentry, so do not also listvaryyourself; it is derived (and rejected if present). An axis replaces one whole field, so the supported paths are the wholesale-replaceable subset of the variant-override fields:/defaults/<field>(runs,timeout,model,executor,judge_model) and/environment/<field>forskills,git,files, andcommands.mcpServersis the exception — it deep-merges by key rather than replacing wholesale, so target a single entry with/environment/mcpServers/<name>instead of the whole/environment/mcpServersmap (which an axis cannot replace).values— a non-empty list of the values that axis takes. Each value is either:- a scalar (
gpt-5.5,3,true), whose label is its stringified form, or - a single-key
{ label: value }map ({ matched: [...] }), letting you give a readable label to a non-scalar value such as a list or object.
- a scalar (
Generated variant names
Section titled “Generated variant names”Each cell’s name joins its axis labels in declaration order: axis1=label1,axis2=label2,…. These names are stable and are exactly what you pass to --variant and what appears as the variant value in JSONL output and the analytics server. Axis names and value labels must match ^[A-Za-z0-9._-]+$ (use the { label: value } form to give an otherwise-illegal value a clean label); axis names may not be purely numeric.
Baseline
Section titled “Baseline”In matrix mode baseline is a map of axis name → value label that selects exactly one generated cell (every axis must be listed, with a label that exists on that axis). In the example above it resolves to model=gpt-5.5,skills=none.
Constraints
Section titled “Constraints”- The product must yield at least 2 and at most 256 cells.
matrixandvariantsare mutually exclusive — use one or the other.${eval.…}interpolation works inside axis values, just as in hand-written variant overrides.
Excluding individual cells from a grid (ragged matrices) is not supported in v1; enumerate variants explicitly if you need to skip combinations.
Interpolation variables
Section titled “Interpolation variables”Values inside variant overrides support ${eval.…} interpolation, derived from the eval file being processed:
| Variable | Example value |
|---|---|
${eval.name} | The eval spec’s name field |
${eval.path} | tests/dotnet-maui/maui-theming/eval.vally.yaml |
${eval.dir} | tests/dotnet-maui/maui-theming |
${eval.basename} | eval.vally.yaml |
${eval.parent} | maui-theming (last directory segment) |
${eval.grandparent} | dotnet-maui (second-to-last directory segment) |
Useful for templating skill paths against the eval’s location:
variants: matched-skill: environment: skills: ["plugins/${eval.grandparent}/skills/${eval.parent}"]Output
Section titled “Output”vally experiment run writes one subdirectory per variant under the run’s timestamped output directory. See the vally experiment CLI reference for the layout and the JSONL record shape.