Skip to content

Writing Eval Specs

An eval spec (eval.yaml or eval.yml) defines what to test and how to grade it. This guide walks through the format with annotated examples.

Before writing stimuli, you can validate your eval spec for typos and misconfigurations:

Terminal window
vally lint --eval-spec eval.yaml

This catches common mistakes instantly — unknown grader types (with “did you mean?” suggestions), invalid config keys, scoring weight mismatches, and more. See the lint reference for the full list of checks.

Validation also runs automatically when you use eval.

The simplest possible eval spec:

eval.yaml
config:
runs: 1
stimuli:
- name: hello-world
prompt: "Say hello"
graders:
- type: output-contains
config:
substring: "hello"

This defines one stimulus with one grader. The agent will be prompted with “Say hello”, and the grader will check if the output contains “hello”.

eval.yaml
name: my-skill-eval # Optional: human-readable name
description: Evaluates X # Optional: what this eval tests
version: "1.0" # Optional: version tracking
type: capability # capability | regression
# Root-level environment (merged into all stimuli)
environment:
skills:
- ./path/to/SKILL.md
files:
- src: fixtures/input.txt
dest: input.txt
commands:
- npm install
# Execution configuration
config:
runs: 3 # Trials per stimulus
timeout: 120 # Seconds per trial
model: gpt-5.5 # Model for agent execution
executor: copilot-sdk # Which executor to use
# Test cases
stimuli:
- name: basic-usage
prompt: |
Use the skill to generate unit tests for the
function in input.txt.
environment: # Stimulus-level env (merged with root)
files:
- src: fixtures/basic.txt
dest: input.txt
graders:
- type: file-exists
config:
path: "*.test.js"
- type: output-contains
config:
substring: "test"
constraints:
max_turns: 10
max_tokens: 5000
expect_tools: ["write_file"]
- name: edge-case
prompt: "Handle an empty input file."
graders:
- type: output-contains
config:
substring: "empty"
# How to aggregate scores
scoring:
weights:
file-exists: 1.0
output-contains: 0.5
threshold: 0.7

Environments configure the workspace before the agent runs.

Applied to all stimuli. Define here what’s shared:

environment:
skills:
- ./skills/my-skill/SKILL.md # Skills to load
files:
- src: fixtures/shared.txt # Files to copy into workspace
dest: shared.txt
- src: fixtures/test-data # Directories are copied recursively
dest: test-data
commands:
- npm install # Commands to run in workspace

Per-stimulus overrides. Merged with the root environment (arrays are concatenated, not replaced):

stimuli:
- name: test-with-config
environment:
files:
- src: fixtures/config.json
dest: config.json # Added to root files
prompt: "Use the config file..."

Instead of defining environments inline in each eval, you can define shared environments in .vally.yaml and reference them by name:

.vally.yaml
environments:
auth-workspace:
skills: [skills/auth]
files:
- src: fixtures/auth-data.json
dest: test-data.json
commands:
- npm install

Then reference in your eval spec:

evals/auth/eval.yaml
environment: auth-workspace
stimuli:
- name: login-test
prompt: "Test the login flow"

This keeps your eval specs DRY and makes it easy to update shared setup across multiple evals. See the .vally.yaml reference for the full environment schema.

Each grader in a stimulus specifies a type (matching a registered grader name) and optional config:

graders:
- type: output-contains
config:
substring: "success"
case_sensitive: false # Grader-specific option
- type: file-exists
config:
path: "output/*.txt" # Supports glob patterns
- type: run-command
config:
command: "node validate.js"
expected_exit_code: 0
- type: prompt
config:
prompt: "Are the tests comprehensive and well-structured?"
scoring: scale_1_5

See the Grader catalog for all built-in graders and their config options, including the prompt and pairwise LLM judges.

Use a rubric on the stimulus alongside a prompt grader to give the LLM judge evaluation criteria:

config:
judge_model: claude-sonnet-4.6 # default model for all LLM graders
stimuli:
- name: code-quality
prompt: "Refactor this function for readability"
rubric:
- "Code is well-structured with clear variable names"
- "Comments explain non-obvious logic"
- "No dead code or redundant operations"
graders:
- type: prompt
config:
prompt: "Evaluate the refactored code against the rubric criteria"
scoring: scale_1_5

Add pairwise graders for A/B comparisons (these only run via vally compare):

stimuli:
- name: test-generation
prompt: "Write comprehensive tests"
graders:
- type: file-exists
config: { path: "*.test.js" }
- type: pairwise
config:
prompt: "Which set of tests has better coverage?"

Constraints limit what the agent can do during a trial:

constraints:
max_turns: 10 # Maximum conversation turns
max_tokens: 5000 # Maximum total tokens
max_duration: 1m # Maximum wall time
expect_tools: ["write_file"] # Agent MUST call these tools
reject_tools: ["delete_file"] # Agent must NOT call these tools
expect_skills: ["my-skill"] # These skills must be activated
scoring:
weights:
file-exists: 1.0 # Weight for each grader type
output-contains: 0.5
threshold: 0.7 # Minimum aggregate score to pass

If weights is omitted, all graders are weighted equally. See Scoring for how pass@k and pass^k are computed.

Multiple stimuli testing the same capability

Section titled “Multiple stimuli testing the same capability”
stimuli:
- name: simple-case
prompt: "Generate tests for function add(a,b)"
graders: [{ type: file-exists, config: { path: "*.test.js" } }]
- name: complex-case
prompt: "Generate tests for an async API client"
graders: [{ type: file-exists, config: { path: "*.test.js" } }]
- name: edge-case
prompt: "Generate tests for a function with no arguments"
graders: [{ type: file-exists, config: { path: "*.test.js" } }]

Use type: regression to signal this eval is checking for regressions, not new capabilities:

type: regression
config:
runs: 5 # More trials for statistical confidence
scoring:
threshold: 0.9 # Higher bar for regressions