Writing Eval Specs
An eval spec (eval.yaml or eval.yml) defines what to test and how to grade it. This guide walks through the format with annotated examples.
Validate before running
Section titled “Validate before running”Before writing stimuli, you can validate your eval spec for typos and misconfigurations:
vally lint --eval-spec eval.yamlThis catches common mistakes instantly — unknown grader types (with “did you mean?” suggestions), invalid config keys, scoring weight mismatches, and more. See the lint reference for the full list of checks.
Validation also runs automatically when you use eval.
Minimal eval spec
Section titled “Minimal eval spec”The simplest possible eval spec:
config: runs: 1
stimuli: - name: hello-world prompt: "Say hello" graders: - type: output-contains config: substring: "hello"This defines one stimulus with one grader. The agent will be prompted with “Say hello”, and the grader will check if the output contains “hello”.
Full structure
Section titled “Full structure”name: my-skill-eval # Optional: human-readable namedescription: Evaluates X # Optional: what this eval testsversion: "1.0" # Optional: version trackingtype: capability # capability | regression
# Root-level environment (merged into all stimuli)environment: skills: - ./path/to/SKILL.md files: - src: fixtures/input.txt dest: input.txt commands: - npm install
# Execution configurationconfig: runs: 3 # Trials per stimulus timeout: 120 # Seconds per trial model: gpt-5.5 # Model for agent execution executor: copilot-sdk # Which executor to use
# Test casesstimuli: - name: basic-usage prompt: | Use the skill to generate unit tests for the function in input.txt. environment: # Stimulus-level env (merged with root) files: - src: fixtures/basic.txt dest: input.txt graders: - type: file-exists config: path: "*.test.js" - type: output-contains config: substring: "test" constraints: max_turns: 10 max_tokens: 5000 expect_tools: ["write_file"]
- name: edge-case prompt: "Handle an empty input file." graders: - type: output-contains config: substring: "empty"
# How to aggregate scoresscoring: weights: file-exists: 1.0 output-contains: 0.5 threshold: 0.7Environments
Section titled “Environments”Environments configure the workspace before the agent runs.
Root environment
Section titled “Root environment”Applied to all stimuli. Define here what’s shared:
environment: skills: - ./skills/my-skill/SKILL.md # Skills to load files: - src: fixtures/shared.txt # Files to copy into workspace dest: shared.txt - src: fixtures/test-data # Directories are copied recursively dest: test-data commands: - npm install # Commands to run in workspaceStimulus environment
Section titled “Stimulus environment”Per-stimulus overrides. Merged with the root environment (arrays are concatenated, not replaced):
stimuli: - name: test-with-config environment: files: - src: fixtures/config.json dest: config.json # Added to root files prompt: "Use the config file..."Using Named Environments
Section titled “Using Named Environments”Instead of defining environments inline in each eval, you can define shared environments in .vally.yaml and reference them by name:
environments: auth-workspace: skills: [skills/auth] files: - src: fixtures/auth-data.json dest: test-data.json commands: - npm installThen reference in your eval spec:
environment: auth-workspacestimuli: - name: login-test prompt: "Test the login flow"This keeps your eval specs DRY and makes it easy to update shared setup across multiple evals. See the .vally.yaml reference for the full environment schema.
Grader configuration
Section titled “Grader configuration”Each grader in a stimulus specifies a type (matching a registered grader name) and optional config:
graders: - type: output-contains config: substring: "success" case_sensitive: false # Grader-specific option
- type: file-exists config: path: "output/*.txt" # Supports glob patterns
- type: run-command config: command: "node validate.js" expected_exit_code: 0
- type: prompt config: prompt: "Are the tests comprehensive and well-structured?" scoring: scale_1_5See the Grader catalog for all built-in graders and their config options, including the prompt and pairwise LLM judges.
Using LLM judges
Section titled “Using LLM judges”Prompt grader with rubric
Section titled “Prompt grader with rubric”Use a rubric on the stimulus alongside a prompt grader to give the LLM judge evaluation criteria:
config: judge_model: claude-sonnet-4.6 # default model for all LLM graders
stimuli: - name: code-quality prompt: "Refactor this function for readability" rubric: - "Code is well-structured with clear variable names" - "Comments explain non-obvious logic" - "No dead code or redundant operations" graders: - type: prompt config: prompt: "Evaluate the refactored code against the rubric criteria" scoring: scale_1_5Pairwise comparison
Section titled “Pairwise comparison”Add pairwise graders for A/B comparisons (these only run via vally compare):
stimuli: - name: test-generation prompt: "Write comprehensive tests" graders: - type: file-exists config: { path: "*.test.js" } - type: pairwise config: prompt: "Which set of tests has better coverage?"Constraints
Section titled “Constraints”Constraints limit what the agent can do during a trial:
constraints: max_turns: 10 # Maximum conversation turns max_tokens: 5000 # Maximum total tokens max_duration: 1m # Maximum wall time expect_tools: ["write_file"] # Agent MUST call these tools reject_tools: ["delete_file"] # Agent must NOT call these tools expect_skills: ["my-skill"] # These skills must be activatedScoring configuration
Section titled “Scoring configuration”scoring: weights: file-exists: 1.0 # Weight for each grader type output-contains: 0.5 threshold: 0.7 # Minimum aggregate score to passIf weights is omitted, all graders are weighted equally. See Scoring for how pass@k and pass^k are computed.
Patterns and tips
Section titled “Patterns and tips”Multiple stimuli testing the same capability
Section titled “Multiple stimuli testing the same capability”stimuli: - name: simple-case prompt: "Generate tests for function add(a,b)" graders: [{ type: file-exists, config: { path: "*.test.js" } }]
- name: complex-case prompt: "Generate tests for an async API client" graders: [{ type: file-exists, config: { path: "*.test.js" } }]
- name: edge-case prompt: "Generate tests for a function with no arguments" graders: [{ type: file-exists, config: { path: "*.test.js" } }]Regression testing
Section titled “Regression testing”Use type: regression to signal this eval is checking for regressions, not new capabilities:
type: regression
config: runs: 5 # More trials for statistical confidence
scoring: threshold: 0.9 # Higher bar for regressionsNext steps
Section titled “Next steps”- eval.yaml schema reference — complete field specification
- Grader catalog — all available graders
- Debugging evals — when evals fail unexpectedly