Writing Eval Specs

An eval spec (eval.yaml or eval.yml) defines what to test and how to grade it. This guide walks through the format with annotated examples.

Validate before running

Before writing stimuli, you can validate your eval spec for typos and misconfigurations:

vally lint --eval-spec eval.yaml

This catches common mistakes instantly — unknown grader types (with “did you mean?” suggestions), invalid config keys, scoring weight mismatches, and more. See the lint reference for the full list of checks.

Validation also runs automatically when you use eval.

Minimal eval spec

The simplest possible eval spec:

defaults:
  runs: 1

stimuli:
  - name: hello-world
    prompt: "Say hello"
    graders:
      - type: output-contains
        config:
          substring: "hello"

This defines one stimulus with one grader. The agent will be prompted with “Say hello”, and the grader will check if the output contains “hello”.

Full structure

name: my-skill-eval # Optional: human-readable name
description: Evaluates X # Optional: what this eval tests
version: "1.0" # Optional: version tracking
type: capability # capability | regression

# Root-level environment (merged into all stimuli)
environment:
  skills:
    - ./path/to/my-skill # Skill directory (containing SKILL.md)
  files:
    - src: fixtures/input.txt
      dest: input.txt
  commands:
    - npm install

# Execution configuration
defaults:
  runs: 3 # Trials per stimulus
  timeout: 120s # Duration per trial (requires a unit suffix)
  model: gpt-5.5 # Model for agent execution
  executor: copilot-sdk # Which executor to use

# Test cases
stimuli:
  - name: basic-usage
    prompt: |
      Use the skill to generate unit tests for the
      function in input.txt.
    environment: # Stimulus-level env (merged with root)
      files:
        - src: fixtures/basic.txt
          dest: input.txt
    graders:
      - type: file-exists
        config:
          path: "*.test.js"
      - type: output-contains
        config:
          substring: "test"
    constraints:
      max_turns: 10
      max_tokens: 5000
      expect_tools: ["write_file"]

  - name: edge-case
    prompt: "Handle an empty input file."
    graders:
      - type: output-contains
        config:
          substring: "empty"

# How to aggregate scores
scoring:
  weights:
    file-exists: 0.7
    output-contains: 0.3
  threshold: 0.7

Environments

Environments configure the workspace before the agent runs.

Root environment

Applied to all stimuli. Define here what’s shared:

environment:
  skills:
    - ./skills/my-skill # Skill directory (containing SKILL.md)
  files:
    - src: fixtures/shared.txt # Files to copy into workspace
      dest: shared.txt
    - src: fixtures/test-data # Directories are copied recursively
      dest: test-data
  commands:
    - npm install # Commands to run in workspace (/bin/sh on Unix, cmd.exe on Windows)

Stimulus environment

Per-stimulus overrides. Merged with the root environment (arrays are concatenated, not replaced):

stimuli:
  - name: test-with-config
    environment:
      files:
        - src: fixtures/config.json
          dest: config.json # Added to root files
    prompt: "Use the config file..."

Using Named Environments

Instead of defining environments inline in each eval, you can define shared environments in .vally.yaml and reference them by name:

environments:
  auth-workspace:
    skills: [skills/auth]
    files:
      - src: fixtures/auth-data.json
        dest: test-data.json
    commands:
      - npm install

Then reference in your eval spec:

environment: auth-workspace
stimuli:
  - name: login-test
    prompt: "Test the login flow"

This keeps your eval specs DRY and makes it easy to update shared setup across multiple evals. See the .vally.yaml reference for the full environment schema.

Grader configuration

Each grader in a stimulus specifies a type (matching a registered grader name) and optional config:

graders:
  - type: output-contains
    config:
      substring: "success"
      case_sensitive: false # Grader-specific option

  - type: file-exists
    config:
      path: "output/*.txt" # Supports glob patterns

  - type: run-command
    config:
      command: "node validate.js"
      expected_exit_code: 0

  - type: prompt
    config:
      prompt: "Are the tests comprehensive and well-structured?"
      scoring: scale_1_5

See the Grader catalog for all built-in graders and their config options, including the prompt and panel LLM judges.

Using LLM judges

Prompt grader with rubric

Use a rubric on the stimulus alongside a prompt grader to give the LLM judge evaluation criteria:

defaults:
  judge_model: claude-sonnet-4.6 # default model for all LLM graders

stimuli:
  - name: code-quality
    prompt: "Refactor this function for readability"
    rubric:
      - "Code is well-structured with clear variable names"
      - "Comments explain non-obvious logic"
      - "No dead code or redundant operations"
    graders:
      - type: prompt
        config:
          prompt: "Evaluate the refactored code against the rubric criteria"
          scoring: scale_1_5

A/B comparison

Comparison is a mode of the prompt judge — any stimulus with a rubric can be compared. Run vally compare over an experiment’s output or two independent runs, and the prompt judge compares the baseline against each treatment using that rubric.

# Compare two independent runs of the same eval spec
vally compare --baseline ./results/main --treatment ./results/pr-branch

Constraints

Constraints limit what the agent can do during a trial:

constraints:
  max_turns: 10 # Maximum conversation turns
  max_tokens: 5000 # Maximum total tokens
  max_duration: 1m # Hard cap on the agent's run — overrunning fails the eval
  max_agent_duration: 45s # Agent working-limit — stop the agent, then grade what it finished
  expect_tools: ["write_file"] # Agent MUST call these tools
  reject_tools: ["delete_file"] # Agent must NOT call these tools
  expect_skills: ["my-skill"] # These skills must be activated

Scoring configuration

When scoring.weights is provided, weights apply per grader type: when a stimulus has multiple graders with the same type (e.g. two output-contains checks), their scores are averaged first, then the type weight is applied once — Σ(weight_t × avg_score_t). Weights must be a normalized distribution (sum to 1.0, ±0.01 tolerance). The vally lint command reports a scoring-weight-sum error when they don’t. Graders absent from the map receive weight 0 and do not contribute to the score. scoring.threshold is applied to the aggregate score when present, and can be overridden for a run with vally eval --threshold <number>.

scoring:
  weights:
    file-exists: 0.7 # Weight for each grader type
    output-contains: 0.3
  threshold: 0.7 # Minimum aggregate score to pass

If weights is omitted, all graders are averaged equally. See Scoring for how pass@k and pass^k are computed.

Patterns and tips

Multiple stimuli testing the same capability

stimuli:
  - name: simple-case
    prompt: "Generate tests for function add(a,b)"
    graders: [{ type: file-exists, config: { path: "*.test.js" } }]

  - name: complex-case
    prompt: "Generate tests for an async API client"
    graders: [{ type: file-exists, config: { path: "*.test.js" } }]

  - name: edge-case
    prompt: "Generate tests for a function with no arguments"
    graders: [{ type: file-exists, config: { path: "*.test.js" } }]

Regression testing

Use type: regression to signal this eval is checking for regressions, not new capabilities:

type: regression

defaults:
  runs: 5 # More trials for statistical confidence

scoring:
  threshold: 0.9 # Higher bar for regressions

Next steps

eval.yaml schema reference — complete field specification
Grader catalog — all available graders
Debugging evals — when evals fail unexpectedly