Skip to content

Eval Spec Reference

The eval.yaml file defines what to test and how to grade it. This page documents every field.

FieldTypeRequiredDescription
namestringNoHuman-readable name for this eval
descriptionstringNoWhat this eval tests
versionstringNoVersion for tracking
type"capability" | "regression"NoEval intent
environmentEnvironmentConfigNoRoot environment (merged into all stimuli)
configEvalConfigNoExecution configuration
stimuliStimulus[]YesTest cases (at least one required)
scoringScoringConfigNoScore aggregation settings
config:
runs: 3 # Trials per stimulus (default: 1)
timeout: 2m # Per trial (e.g. 5m, 300s, 30000ms); unit suffix required (default: 2m)
model: gpt-5.5 # Model for agent execution
judge_model: gpt-5.5 # Default model for LLM graders
executor: copilot-sdk # or custom executor via --executor-plugin
FieldTypeDefaultDescription
runsnumber1Number of trials per stimulus. 0 = lint-only (skip execution). 1 = single run. ≥ 2 = multi-trial with pass@k/pass^k aggregation.
timeoutduration2mTime before a trial times out. Requires a unit suffix (e.g. 5m, 300s, 30000ms).
modelstringModel identifier for the executor
judge_modelstringDefault model for LLM judge graders (prompt and pairwise). Graders can override this in their own config.
executorstringName of the executor to use for this eval. A spec selects a single executor. Built-in: copilot-sdk. Additional executors can be registered by passing --executor-plugin on the CLI, then referenced here by name. The flag is repeatable so a single vally eval invocation can run multiple specs that select different executors.
stimuli:
- name: my-test-case
prompt: "Do something"
environment: { ... }
graders: [...]
rubric: ["criterion 1", "criterion 2"]
constraints: { ... }
FieldTypeRequiredDescription
namestringYesUnique identifier for this stimulus
promptstringYesThe prompt sent to the agent
environmentEnvironmentConfigNoStimulus-specific environment (merged with root)
gradersGraderConfig[]NoGraders to run on the trajectory
rubricstring[]NoEvaluation criteria for LLM judge graders. Passed to prompt and pairwise graders as context for their judgment.
constraintsConstraintsNoResource limits for the trial
tagsRecord<string, string | string[]>NoKey-value pairs for filtering into suites

Tags are key-value pairs used for filtering stimuli into suites. They can be defined at the eval level (inherited by all stimuli) or the stimulus level (overrides eval-level tags on the same key).

tags:
priority: p0
area: [auth, security]

Filtering semantics:

  • AND across keys — a stimulus must match all specified tag keys
  • OR within values — a stimulus matches a key if it has any of the specified values
  • Stimulus tags override eval tags on the same key

See Authoring Eval Suites for tagging strategies and suite configuration.

Environment Fields

environment:
skills:
- ./path/to/SKILL.md
files:
- src: fixtures/input.txt
dest: input.txt
- src: fixtures/test-data # directories are copied recursively
dest: test-data
commands:
- npm install
git:
type: worktree
ref: v2.1.0
source: ../my-repo
mcpServers:
db:
type: stdio
command: db-serve
args: ["--port", "5432"]
api:
type: http
url: http://localhost:3000/mcp
FieldTypeDescription
commandsstring[]Shell commands to run during setup
files{src, dest}[]Files or directories to copy into the workspace before execution
gitobjectGit worktree configuration for fixture data (see below)
mcpServersRecord<string, McpServerConfig>Named MCP servers to start or connect to
skillsstring[]Paths to SKILL.md files to load

Git config

The git field sets up a Git worktree as the evaluation workspace, checking out a specific commit, tag, or branch from a local repository.

environment:
git:
type: worktree
ref: v2.1.0
source: ../my-repo
commands:
- dotnet restore
FieldTypeRequiredDescription
type"worktree"YesMust be "worktree"
refstringYesA commit-ish value (tag, commit SHA, or branch name) to check out
sourcestringYesPath to the local repo used as the worktree source

MCP server config

Each entry in mcpServers is either a stdio server (launched as a child process) or a remote server (connected over HTTP/SSE).

Stdio (child process)

mcpServers:
db:
type: stdio
command: db-serve
args: ["--port", "5432"]
env:
DB_HOST: localhost
cwd: ./services/db
timeout: 5000
FieldTypeRequiredDescription
type"stdio"YesLaunch as a child process
commandstringYesExecutable to run
argsstring[]NoArguments passed to the command
envRecord<string, string>NoExtra environment variables for the child process
cwdstringNoWorking directory for the child process
timeoutnumberNoTimeout in milliseconds for connecting to / invoking the server

Remote (HTTP/SSE)

mcpServers:
api:
type: http # or "sse"
url: http://localhost:3000/mcp
headers:
Authorization: "Bearer ${API_TOKEN}"
timeout: 10000
FieldTypeRequiredDescription
type"http" | "sse"YesConnect to a remote server
urlstringYesServer endpoint URL
headersRecord<string, string>NoExtra HTTP headers (e.g. auth tokens)
timeoutnumberNoTimeout in milliseconds for connecting to / invoking the server
graders:
- type: output-contains
name: "output includes hello"
config:
substring: "hello"
case_sensitive: false
FieldTypeRequiredDescription
typestringYesRegistered grader name (e.g., output-contains, file-exists)
namestringNoHuman-readable display name for this grader instance. Defaults to type when omitted.
configobjectNoGrader-specific configuration (varies by type)
constraints:
max_turns: 10
max_tokens: 5000
max_duration: 1m
expect_tools: ["write_file", "read_file"]
reject_tools: ["delete_file"]
expect_skills: ["my-skill"]
reject_skills: []
FieldTypeDescription
max_turnsnumberMaximum conversation turns
max_tokensnumberMaximum total tokens
max_durationdurationMaximum wall-clock time (e.g. 2m, 30s, 500ms). Unit suffix required.
expect_toolsstring[]Tools the agent must call
reject_toolsstring[]Tools the agent must not call
expect_skillsstring[]Skills that must be activated
reject_skillsstring[]Skills that must not be activated
scoring:
weights:
file-exists: 1.0
output-contains: 0.5
threshold: 0.7
FieldTypeDescription
weightsRecord<string, number>Weight per grader type. If omitted, all graders weighted equally.
thresholdnumberMinimum aggregate score to pass (0.0–1.0)
eval.yaml
name: test-writer-eval
description: Evaluates the test-writer skill
version: "1.0"
type: capability
environment:
skills:
- ./SKILL.md
config:
runs: 3
timeout: 2m
model: gpt-5.5
executor: copilot-sdk
stimuli:
- name: basic-test-generation
prompt: |
Write unit tests for this function:
function add(a, b) { return a + b; }
Save the tests to add.test.js.
graders:
- type: file-exists
name: "test file was created"
config:
path: "add.test.js"
- type: output-contains
config:
substring: "test"
constraints:
max_turns: 10
expect_tools: ["write_file"]
- name: edge-case-empty-input
prompt: "Write tests for a function that takes no arguments."
graders:
- type: file-exists
config:
path: "*.test.js"
scoring:
weights:
file-exists: 1.0
output-contains: 0.5
threshold: 0.7