Eval Spec Reference

The eval.yaml file defines what to test and how to grade it. This page documents every field.

Top-level fields

Field	Type	Required	Description
`name`	string	No	Human-readable name for this eval
`description`	string	No	What this eval tests
`version`	string	No	Version for tracking
`type`	`"capability"` \| `"regression"`	No	Eval intent
`environment`	EnvironmentConfig	No	Root environment (merged into all stimuli)
`defaults`	EvalDefaults	No	Execution configuration
`stimuli`	Stimulus[]	Yes	Test cases (at least one required)
`scoring`	ScoringConfig	No	Score aggregation settings

Defaults

defaults:
  runs: 3 # Trials per stimulus (default: 1)
  timeout: 2m # Per trial (e.g. 5m, 300s, 30000ms); unit suffix required (default: 2m)
  model: gpt-5.5 # Model for agent execution
  judge_model: gpt-5.5 # Default model for LLM graders
  judge_reasoning_effort: high # Reasoning effort for the judge model
  executor: copilot-sdk # or custom executor via --executor-plugin

Field	Type	Default	Description
`runs`	number	`1`	Number of trials per stimulus. `0` = lint-only (skip execution). `1` = single run. `≥ 2` = multi-trial with pass@k/pass^k aggregation.
`timeout`	duration	`2m`	Time before a trial times out. Requires a unit suffix (e.g. `5m`, `300s`, `30000ms`).
`model`	string	—	Model identifier for the executor
`judge_model`	string	—	Default model for LLM judge graders (`prompt` and `panel`). Graders can override this in their own config.
`judge_reasoning_effort`	string	—	Reasoning effort for the judge model: `low`, `medium`, `high`, or `xhigh`. Applies only to the eval-level `judge_model` — not to per-grader `model` overrides or the panel’s `models`. When unset, the judge runs at the model’s own default effort. Only takes effect on models that support reasoning effort; unsupported models may ignore or reject the setting.
`executor`	string	—	Name of the executor to use for this eval. A spec selects a single executor. Built-in: `copilot-sdk`. Additional executors can be registered by passing `--executor-plugin` on the CLI, then referenced here by name. The flag is repeatable so a single `vally eval` invocation can run multiple specs that select different executors.

Stimulus

stimuli:
  - name: my-test-case
    prompt: "Do something"
    environment: { ... }
    artifacts: { ... }
    graders: [...]
    rubric: ["criterion 1", "criterion 2"]
    constraints: { ... }

For multi-turn conversations, use turns instead of prompt:

stimuli:
  - name: recall-across-turns
    turns:
      - "Remember this code name: ALPHA-7."
      - "What was the code name I gave you?"
    graders:
      - type: output-contains
        turn: 1
        config:
          substring: "ALPHA-7"

Field	Type	Required	Description
`name`	string	Yes	Unique identifier for this stimulus
`prompt`	string	One of `prompt` or `turns`	The prompt sent to the agent. When `turns` is provided, `prompt` is synthesized automatically and any explicit value is ignored.
`turns`	string[]	One of `prompt` or `turns`	Ordered list of prompts for a multi-turn conversation. Each entry is sent sequentially to the same agent session.
`environment`	EnvironmentConfig	No	Stimulus-specific environment (merged with root)
`artifacts`	ArtifactsConfig	No	File copy filters applied before workspace cleanup to capture output artifacts in run results.
`graders`	GraderConfig[]	No	Graders to run on the trajectory
`rubric`	string[]	No	Evaluation criteria for LLM judge graders. Passed to `prompt` and `panel` graders as context for their judgment, and used by `vally compare` for comparisons (overrideable with `--eval-spec`).
`constraints`	Constraints	No	Resource limits for the trial
`simulation`	Simulation	No	Tool-call simulation: canned responses for specific tools (deterministic, side-effect-free evals)
`supported_executors`	string[]	No	Allow-list of executor IDs (e.g. `copilot-sdk`) this stimulus can run under. When the active executor isn’t listed, the stimulus is skipped (recorded with status `skipped`, not run or failed). Omit to run under any executor. An empty list is rejected at load time.
`golden_patch`	GoldenPatch	No	Reference-solution diff for oracle grading
`tags`	`Record<string, string \| string[]>`	No	Key-value pairs for filtering into suites

Artifacts

artifacts:
  include:
    - "**/Cargo.toml"
    - "**/src/**/*.rs"
  exclude:
    - "**/target/**"

Field	Type	Required	Description
`include`	string[]	Yes	Non-empty list of path patterns to copy.
`exclude`	string[]	No	Path patterns to remove from the include matches.

Precedence: if a path matches both include and exclude, it is excluded.

Pattern semantics:

Pattern matching uses picomatch globs (supports *, ?, **, {a,b}, and [ab]).
Patterns without glob tokens are treated as exact file or directory matches. Example: foo.txt matches only that file, while dir matches files under dir/ recursively.
Copying artifacts recursively scans the workspace to evaluate include/exclude rules, so this option can add noticeable overhead in very large repositories.
.git is always excluded regardless of include patterns, even if you glob **/*.

Destination layout:

For vally eval, copied files are placed in an artifacts/ subdirectory of the trial’s session-log directory — <runDir>/<eval>/<stimulus>/<model>/<trial>/artifacts/ — so a trial’s logs (events.jsonl/metadata.json) and its copied artifacts sit together. Each copied file keeps its workspace-relative path under artifacts/.

vally experiment co-locates artifacts the same way, under each variant’s per-trial session directory: <runDir>/<variant>/<eval>/<stimulus>/<model>/<trial>/artifacts/.

Golden patch

A golden patch is a unified-diff reference solution for the stimulus. It powers vally oracle, which materializes the stimulus’s starting environment, applies the patch, and runs the graders against the result — so you can verify the harness (graders should pass on the correct answer) and gate CI before publishing a benchmark.

Provide exactly one of inline or path:

stimuli:
  # Patch stored in a file (resolved relative to the eval file's directory)
  - name: fix-the-bug
    prompt: "Fix the off-by-one error in pagination"
    golden_patch:
      path: solutions/fix-the-bug.patch

  # Patch embedded inline
  - name: add-readme
    prompt: "Add a README"
    golden_patch:
      inline: |
        --- /dev/null
        +++ b/README.md
        @@ -0,0 +1 @@
        +# My Project

Field	Type	Required	Description
`inline`	string	One of the two	The unified-diff patch text, embedded directly in the spec
`path`	string	One of the two	Path to a patch file, resolved relative to the eval file’s directory

The resolved diff is passed to graders — and surfaced to the prompt LLM judge as a reference solution — only under vally oracle and vally grade. A plain vally eval run does not thread the golden patch into grading, so adding golden_patch alone gives the judge no reference solution during a normal eval.

Environment

Environment Fields

environment:
  skills:
    - ./path/to/my-skill # Skill directory (containing SKILL.md)
  files:
    - src: fixtures/input.txt
      dest: input.txt
    - src: fixtures/test-data # directories are copied recursively
      dest: test-data
  commands:
    - npm install
  commandTimeout: 2m
  git:
    type: worktree
    ref: v2.1.0
    source: ../my-repo
  mcpServers:
    db:
      type: stdio
      command: db-serve
      args: ["--port", "5432"]
    api:
      type: http
      url: http://localhost:3000/mcp

Field	Type	Description
`commands`	`string[]`	Shell commands to run during setup (uses `/bin/sh` on Unix, `cmd.exe` on Windows)
`commandTimeout`	`duration`	Per-command timeout for `commands` (e.g. `2m`, `90s`). Must be positive (`0` is rejected). Defaults to `60s`
`files`	`{src, dest}[]`	Files or directories to copy into the workspace before execution
`git`	`object`	Git configuration for fixture data — local worktree or remote clone (see below)
`mcpServers`	`Record<string, McpServerConfig>`	Named MCP servers to start or connect to
`skills`	`string[]`	Paths to skill directories (each containing a `SKILL.md`) to load

Git config

The git field sets up the evaluation workspace from a Git repository. It has two modes, discriminated by type:

worktree — check out a ref from a local repository as a detached worktree.
clone — clone a remote repository at a given ref into the workspace.

`type: worktree`

environment:
  git:
    type: worktree
    ref: v2.1.0
    source: ../my-repo
  commands:
    - dotnet restore

Field	Type	Required	Description
`type`	`"worktree"`	Yes	Must be `"worktree"`
`ref`	string	Yes	A commit-ish value (tag, commit SHA, or branch name) to check out
`source`	string	Yes	Path to the local repo used as the worktree source

`type: clone`

environment:
  git:
    type: clone
    url: https://github.com/octocat/hello.git
    ref: v2.1.0 # optional; defaults to the remote's default branch
    shallow: true # optional; true → depth 1, or an explicit integer depth
    sparse: # optional; cone-mode paths to materialize
      - src/core

Field	Type	Required	Description
`type`	`"clone"`	Yes	Must be `"clone"`
`url`	string	Yes	Remote repository URL to clone: http(s), ssh, git, `file://`, or scp-style `git@host:path`
`ref`	string	No	Commit-ish to check out; defaults to the remote’s default branch
`shallow`	boolean \| integer	No	Shallow-clone history: `true` fetches depth 1, an integer sets an explicit depth
`sparse`	string[]	No	Sparse-checkout (cone mode) paths — only these directories are materialized

MCP server config

Each entry in mcpServers is either a stdio server (launched as a child process) or a remote server (connected over HTTP/SSE).

Stdio (child process)

mcpServers:
  db:
    type: stdio
    command: db-serve
    args: ["--port", "5432"]
    env:
      DB_HOST: localhost
    cwd: ./services/db
    timeout: 5000

Field	Type	Required	Description
`type`	`"stdio"`	Yes	Launch as a child process
`command`	string	Yes	Executable to run
`args`	string[]	No	Arguments passed to the command
`env`	`Record<string, string>`	No	Extra environment variables for the child process
`cwd`	string	No	Working directory for the child process
`timeout`	number	No	Timeout in milliseconds for connecting to / invoking the server

Remote (HTTP/SSE)

mcpServers:
  api:
    type: http # or "sse"
    url: http://localhost:3000/mcp
    headers:
      Authorization: "Bearer ${API_TOKEN}"
    timeout: 10000

Field	Type	Required	Description
`type`	`"http"` \| `"sse"`	Yes	Connect to a remote server
`url`	string	Yes	Server endpoint URL
`headers`	`Record<string, string>`	No	Extra HTTP headers (e.g. auth tokens)
`timeout`	number	No	Timeout in milliseconds for connecting to / invoking the server

Grader config

graders:
  - type: output-contains
    name: "output includes hello"
    config:
      substring: "hello"
      case_sensitive: false

Field	Type	Required	Description
`type`	string	Yes	Registered grader name (e.g., `output-contains`, `file-exists`)
`name`	string	No	Human-readable display name for this grader instance. Defaults to `type` when omitted.
`turn`	integer	No	Scope this grader to a specific conversation turn (0-based). The pipeline slices the trajectory to that turn before grading. Only valid for non-workspace graders.
`scope`	string \| object	No	Scope this grader to the parent agent or a specific subagent. See Subagent scope. Only valid for non-workspace graders.
`config`	object	No	Grader-specific configuration (varies by type)

Subagent scope

scope restricts a grader to one agent’s slice of the trajectory, so you can assert behavior of a subagent separately from the parent (root) agent. It composes with turn and, like turn, works generically for any non-workspace grader — the pipeline slices the trajectory before grading.

`scope` value	Selects
`parent`	Only the parent/root agent (events with no subagent identity)
`subagent`	Any subagent (regardless of which)
`{ agent: "<name/id>" }`	Events whose agent identity equals the given value

Agent identity comes from the adapter’s own labeling, which is not guaranteed to be a unique instance id. So { agent } matches all events sharing that identity: if a parent fans out several same-named subagents, they’re selected together as one aggregate slice, not addressed individually.

graders:
  # The parent delegated the edit; assert it did not edit directly.
  - type: tool-calls
    scope: parent
    config:
      disallowed: [edit]
  # Assert the delegated subagent actually performed the edit.
  - type: tool-calls
    scope: subagent
    config:
      required: [edit]

Per-adapter identity

Per-call agent identity comes from the trajectory adapter, and its meaning — and therefore whether { agent } is usable — differs by source:

Copilot stamps each event with the subagent agentId from the SDK envelope. scope: parent and scope: subagent work reliably. That id is the opaque launching tool-call id (e.g. toolu_01Biq…), not a friendly agent name, so a { agent } value can’t be authored ahead of a run and isn’t portable across runs — prefer parent / subagent on Copilot.
ATIF trajectories support subagent scoping when the document inlines its subagent trajectories (subagent_trajectories linked from a step’s subagent_trajectory_ref): the adapter inlines those events stamped with the subagent’s declared agent name, so { agent: "<name>" } is authorable. ATIF subagents referenced only as external files (by session_id / trajectory_path, not inlined) carry no events in the document, so scope: subagent / { agent } selects nothing for those.

When a scope (or scope + turn) selects no events, the grader fails loud with the list of subagent ids that are present (or a note that none were captured), rather than passing vacuously.

Constraints

constraints:
  max_turns: 10
  max_tokens: 5000
  max_duration: 1m
  max_agent_duration: 45s
  expect_tools: ["write_file", "read_file"]
  reject_tools: ["delete_file"]
  expect_skills: ["my-skill"]
  reject_skills: []

Field	Type	Description
`max_turns`	number	Maximum conversation turns
`max_tokens`	number	Maximum total tokens
`max_duration`	duration	Hard wall-clock cap on the agent’s run (e.g. `2m`, `30s`, `500ms`). Unit suffix required. It bounds the in-flight agent work only — workspace setup and teardown/disconnect are bounded separately, not by this cap. Exceeding it aborts the run and fails the eval.
`max_agent_duration`	duration	Agent working-time limit (e.g. `10m`, `90s`). The agent gets up to this long to work; on expiry it is stopped and the trial proceeds through normal teardown, grading whatever was produced as a success.
`expect_tools`	string[]	Tools the agent must call
`reject_tools`	string[]	Tools the agent must not call
`expect_skills`	string[]	Skills that must be activated
`reject_skills`	string[]	Skills that must not be activated

max_duration vs max_agent_duration — two different ways to limit run time:

max_duration is a hard ceiling on the agent’s wall-clock run. Overrun it and the run is aborted and fails. It bounds the in-flight agent work only — workspace setup and teardown/disconnect are bounded separately, not by this cap.
max_agent_duration caps only the agent’s working time: “work for up to 10 minutes, then we’ll stop you and grade whatever you finished.” On expiry the agent is stopped and the trial runs through normal teardown, graded as a success.

A few details:

The working-limit covers the whole stimulus run (all turns), not a single turn.
Both limits stop the agent the same way under the hood (the executors abort the in-flight agent run — the Copilot SDK and Claude CLI offer no gentler mechanism). The difference is the verdict: max_duration fails the run, while max_agent_duration keeps it a success and grades the partial work.
When the working-limit stops a run, its trajectory records endReason: "agent_timeout" (vs "completed" for a normal finish) so graders and the analytics server can account for it.
Setting only max_agent_duration is enough — the run is allowed to last the full working-limit.
If you set both, keep the working-limit below the hard cap. The effective hard cap is max_duration (or, if unset, the eval’s defaults.timeout, then the --timeout override). When the working-limit is >= that cap, the hard cap fails the run before the working-limit can stop it cleanly. vally lint warns when it can detect this.

Simulation

Run the agent against a deterministic, mocked toolset instead of the real environment. Useful for loop/repetition detection, failure-mode recovery tests, and behavioral evals that must not cause real side effects. Requires a simulation-capable executor (copilot-sdk); other executors reject a stimulus that declares simulation.

Scope: first-party tools only (MCP server tools are not simulated).

stimuli:
  - name: npm-registry-down
    prompt: "Install the project dependencies"
    simulation:
      max_iterations: 20
      tool_overrides:
        # Shell tools can simulate success OR failure, and support ordered
        # regex patterns matched against the command (first match wins).
        bash:
          patterns:
            - match: "npm install|npm ci"
              output: |-
                npm error code ETIMEDOUT
                npm error network request to registry.npmjs.org failed
              is_error: true
            - match: "node -v"
              output: "v14.21.3" # successful stdout (is_error defaults to false)
        # Non-shell tools can only simulate failures.
        web_fetch:
          output: "Error: HTTP 403 Forbidden"
          is_error: true

Field	Type	Required	Description
`tool_overrides`	`Record<string, Override>`	No	Per-tool canned responses, keyed by tool name (`bash`, `web_fetch`, …).
`max_iterations`	number	No	Cap on model iterations (assistant turns). Exceeding it stops the run cleanly with `endReason: simulation_cap`; this is not a run error — graders still run on the collected trajectory.

Each override value is one of:

string — shorthand for successful stdout ({ output: <str>, is_error: false }). Because it resolves to success, this form is only valid for shell tools (see Shell vs non-shell below); non-shell tools must use { output, is_error: true }.
object — { output: string, is_error?: boolean }. is_error defaults to false.
patterns — { patterns: [{ match, output, is_error? }], default? } for input-dependent shell tools. match is a case-insensitive regex tested against the command string; the first match wins. If nothing matches, default is used, or — when omitted — the real tool runs.

Shell vs non-shell: shell tools (bash, shell, powershell, treated as aliases) can simulate a successful or failed result because the command is rewritten to emit the canned output. All other first-party tools can only simulate failures — a non-shell override that resolves to success (e.g. a plain string, or is_error: false) is rejected by vally lint. Pattern overrides are shell-only.

Recording: simulated tool calls are flagged in the trajectory (simulated: true) and counted in metrics (simulatedToolCallCount), so graders and reports can tell simulated calls apart from real ones.

Scoring

scoring:
  weights:
    file-exists: 0.7
    output-contains: 0.3
  threshold: 0.7

Field	Type	Description
`weights`	`Record<string, number>`	Weight per grader type. Must be a normalized distribution (values sum to 1.0 ±0.01). When multiple graders share the same type, their scores are averaged before the weight is applied. Graders absent from the map receive weight 0. If omitted, all graders are averaged equally.
`threshold`	number	Optional. Minimum aggregate score to pass (0.0–1.0). When omitted, the verdict falls back to binary all-graders-pass and `scoring.weights` has no effect on the verdict. Can be overridden per-run with `vally eval --threshold <number>`.

Complete example

name: test-writer-eval
description: Evaluates the test-writer skill
version: "1.0"
type: capability

environment:
  skills:
    - ./my-skill # Skill directory (containing SKILL.md)

defaults:
  runs: 3
  timeout: 2m
  model: gpt-5.5
  executor: copilot-sdk

stimuli:
  - name: basic-test-generation
    prompt: |
      Write unit tests for this function:
      function add(a, b) { return a + b; }
      Save the tests to add.test.js.
    graders:
      - type: file-exists
        name: "test file was created"
        config:
          path: "add.test.js"
      - type: output-contains
        config:
          substring: "test"
    constraints:
      max_turns: 10
      expect_tools: ["write_file"]

  - name: edge-case-empty-input
    prompt: "Write tests for a function that takes no arguments."
    graders:
      - type: file-exists
        config:
          path: "*.test.js"

scoring:
  weights:
    file-exists: 0.7
    output-contains: 0.3
  threshold: 0.7