The eval.yaml file defines what to test and how to grade it. This page documents every field.
Field Type Required Description namestring No Human-readable name for this eval descriptionstring No What this eval tests versionstring No Version for tracking type"capability" | "regression"No Eval intent environmentEnvironmentConfig No Root environment (merged into all stimuli) configEvalConfig No Execution configuration stimuliStimulus[] Yes Test cases (at least one required) scoringScoringConfig No Score aggregation settings
runs : 3 # Trials per stimulus (default: 1)
timeout : 2m # Per trial (e.g. 5m, 300s, 30000ms); unit suffix required (default: 2m)
model : gpt-5.5 # Model for agent execution
judge_model : gpt-5.5 # Default model for LLM graders
executor : copilot-sdk # or custom executor via --executor-plugin
Field Type Default Description runsnumber 1Number of trials per stimulus. 0 = lint-only (skip execution). 1 = single run. ≥ 2 = multi-trial with pass@k/pass^k aggregation. timeoutduration 2mTime before a trial times out. Requires a unit suffix (e.g. 5m, 300s, 30000ms). modelstring — Model identifier for the executor judge_modelstring — Default model for LLM judge graders (prompt and pairwise). Graders can override this in their own config. executorstring — Name of the executor to use for this eval. A spec selects a single executor. Built-in: copilot-sdk. Additional executors can be registered by passing --executor-plugin on the CLI, then referenced here by name. The flag is repeatable so a single vally eval invocation can run multiple specs that select different executors.
rubric : [ " criterion 1 " , " criterion 2 " ]
Field Type Required Description namestring Yes Unique identifier for this stimulus promptstring Yes The prompt sent to the agent environmentEnvironmentConfig No Stimulus-specific environment (merged with root) gradersGraderConfig[] No Graders to run on the trajectory rubricstring[] No Evaluation criteria for LLM judge graders. Passed to prompt and pairwise graders as context for their judgment. constraintsConstraints No Resource limits for the trial tagsRecord<string, string | string[]>No Key-value pairs for filtering into suites
Tags are key-value pairs used for filtering stimuli into suites. They can be defined at the eval level (inherited by all stimuli) or the stimulus level (overrides eval-level tags on the same key).
Filtering semantics:
AND across keys — a stimulus must match all specified tag keys
OR within values — a stimulus matches a key if it has any of the specified values
Stimulus tags override eval tags on the same key
See Authoring Eval Suites for tagging strategies and suite configuration.
Environment Fields
- src : fixtures/input.txt
- src : fixtures/test-data # directories are copied recursively
url : http://localhost:3000/mcp
Field Type Description commandsstring[]Shell commands to run during setup files{src, dest}[]Files or directories to copy into the workspace before execution gitobjectGit worktree configuration for fixture data (see below ) mcpServersRecord<string, McpServerConfig>Named MCP servers to start or connect to skillsstring[]Paths to SKILL.md files to load
Git config
The git field sets up a Git worktree as the evaluation workspace, checking out a specific commit, tag, or branch from a local repository.
Field Type Required Description type"worktree"Yes Must be "worktree" refstring Yes A commit-ish value (tag, commit SHA, or branch name) to check out sourcestring Yes Path to the local repo used as the worktree source
MCP server config
Each entry in mcpServers is either a stdio server (launched as a child process) or a remote server (connected over HTTP/SSE).
Stdio (child process)
Field Type Required Description type"stdio"Yes Launch as a child process commandstring Yes Executable to run argsstring[] No Arguments passed to the command envRecord<string, string>No Extra environment variables for the child process cwdstring No Working directory for the child process timeoutnumber No Timeout in milliseconds for connecting to / invoking the server
Remote (HTTP/SSE)
url : http://localhost:3000/mcp
Authorization : " Bearer ${API_TOKEN} "
Field Type Required Description type"http" | "sse"Yes Connect to a remote server urlstring Yes Server endpoint URL headersRecord<string, string>No Extra HTTP headers (e.g. auth tokens) timeoutnumber No Timeout in milliseconds for connecting to / invoking the server
Note
When both root and stimulus environments are defined, they are merged with conflict detection:
Skills : Duplicates across root and stimulus are deduplicated with a warning.
Commands : Duplicate commands across root and stimulus throw an error.
Files : Files with the same dest but different src throw a conflict error. Identical entries are also rejected as redundant.
Only cross-array conflicts are detected — duplicates within the same array (root-only or stimulus-only) are not flagged.
name : " output includes hello "
Field Type Required Description typestring Yes Registered grader name (e.g., output-contains, file-exists) namestring No Human-readable display name for this grader instance. Defaults to type when omitted. configobject No Grader-specific configuration (varies by type)
expect_tools : [ " write_file " , " read_file " ]
reject_tools : [ " delete_file " ]
expect_skills : [ " my-skill " ]
Field Type Description max_turnsnumber Maximum conversation turns max_tokensnumber Maximum total tokens max_durationduration Maximum wall-clock time (e.g. 2m, 30s, 500ms). Unit suffix required. expect_toolsstring[] Tools the agent must call reject_toolsstring[] Tools the agent must not call expect_skillsstring[] Skills that must be activated reject_skillsstring[] Skills that must not be activated
Caution
scoring.threshold is applied to the aggregate score when present, and can be
overridden for a run with vally eval --threshold <number>. scoring.weights
is accepted and validated, but is not yet applied by the evaluation pipeline.
Currently all grader scores are averaged equally. See the
Scoring concept page for details.
Field Type Description weightsRecord<string, number>Weight per grader type. If omitted, all graders weighted equally. thresholdnumber Minimum aggregate score to pass (0.0–1.0)
description : Evaluates the test-writer skill
- name : basic-test-generation
Write unit tests for this function:
function add(a, b) { return a + b; }
Save the tests to add.test.js.
name : " test file was created "
expect_tools : [ " write_file " ]
- name : edge-case-empty-input
prompt : " Write tests for a function that takes no arguments. "