Writing Eval Specs

A complete reference for writing eval.yaml specifications and task definitions.

eval.yaml Structure

The evaluation spec defines the benchmark configuration, graders, and task files:

name: code-explainer-eval
description: Evaluation suite for code-explainer skill
skill: code-explainer
schemaVersion: "1.0"
version: "1.0"

config:
  trials_per_task: 1
  timeout_seconds: 300
  parallel: false
  executor: mock
  model: claude-sonnet-4.6

metrics:
  - name: accuracy
    weight: 1.0
    threshold: 0.8

graders:
  - type: text
    name: explains_concepts
    config:
      regex_match:
        - "(?i)(function|logic|parameter)"
  - type: code
    name: has_output
    config:
      assertions:
        - "len(output) > 100"

tasks:
  - "tasks/*.yaml"

Top-Level Fields

Field	Type	Required	Description
`name`	string	✓	Eval suite name
`description`	string	✗	What the eval tests
`skill`	string	✓	Associated skill or custom agent name (`SKILL.md` or `.agent.md`)
`schemaVersion`	string	✗	Public schema version for this artifact; defaults to `1.0`
`version`	string	✗	Version number (e.g., “1.0”)
`inputs`	object	✗	Key-value map of global template variables (see Template Variables)
`tasks_from`	string	✗	Path to an external YAML file containing the task list
`hooks`	object	✗	Lifecycle hooks that run shell commands at specific points (see Hooks)
`mcp_mocks`	array	✗	Hermetic MCP server mocks for deterministic tool-call evals; requires `schemaVersion: "1.1"`
`adversarial`	object	✗	Built-in fault-injection packs consumed by `waza adversarial --spec`; requires `schemaVersion: "1.2"`
`baseline`	bool	✗	Mark this spec as a baseline for A/B comparison

skill is required.

schemaVersion uses MAJOR.MINOR format. Omit it for legacy files that should default to 1.0; add it to new evals so future schema migrations are explicit. See Schema Changes for the compatibility policy.

Targeting Custom Agents

Waza supports evaluating VS Code custom agents (.agent.md files) alongside traditional SKILL.md-based skills. When you target an agent with a tools: field in its frontmatter, waza automatically injects a tool_constraint grader to validate that only the declared tools are called.

Specify an agent:

name: security-agent-eval
description: Evaluate the security-reviewer custom agent
skill: security-reviewer  # Points to security-reviewer.agent.md
schemaVersion: "1.0"
version: "1.0"

config:
  model: claude-sonnet-4.6

Key differences:

Use skill: <name> to target either a skill or a custom agent
Waza discovers .agent.md files the same way as SKILL.md — in the current directory or agents/ subdirectories
If both SKILL.md and .agent.md exist in the same directory, SKILL.md takes priority
Custom agents can declare a tools: field in frontmatter, which auto-injects a tool_constraint grader

Learn more: See the Evaluating Custom Agents guide for detailed examples and the auto-injected tool constraint behavior.

Config Section

The config block controls execution behavior:

config:
  trials_per_task: 1 # Run each task this many times
  timeout_seconds: 300 # Task timeout in seconds
  parallel: false # Run tasks sequentially (true = concurrent)
  workers: 0 # Auto-size parallel workers if parallel: true
  model: claude-sonnet-4.6 # Default model (override with --model)
  judge_model: gpt-4o # Model for LLM-as-judge graders (optional)
  executor: mock # mock (local) or copilot-sdk (real API)
  inject_skill_body: true # Inject target SKILL.md/.agent.md body into the prompt
  instruction_files:
    - .github/instructions/project.instructions.md

Field	Type	Default	Description
`trials_per_task`	int	1	Number of times each task runs (for statistical analysis)
`timeout_seconds`	int	300	Task timeout in seconds
`first_event_timeout_seconds`	int	0 (off)	Abort a run that produces no first event within N seconds (session-start hang); 0 disables
`parallel`	bool	false	Run tasks concurrently
`workers`	int	0	Number of parallel workers; 0 auto-sizes
`model`	string	required	Default model for tasks (override with `--model` flag)
`judge_model`	string	(same as `model`)	Model for `prompt`-type graders (LLM-as-judge)
`executor`	string	`copilot-sdk`	Executor: `mock` (local, echoes task metadata and file content) or `copilot-sdk` (real API)
`max_attempts`	int	0	Maximum retry attempts per task on failure (0 = no retries)
`group_by`	string	—	Group results by a field (e.g., `tags`, `task_id`)
`fail_fast`	bool	false	Stop the entire run on first task failure
`skill_directories`	list[str]	`[]`	Additional directories to search for skills
`instruction_files`	list[str]	`[]`	Instruction files to apply to every task
`inject_skill_body`	bool	true	Inject the target `SKILL.md` or `.agent.md` body into the system prompt
`disabled_skills`	list[str]	`[]`	Skills to disable. Use `["*"]` to disable all skills
`required_skills`	list[str]	`[]`	Skills that must be available before running
`mcp_servers`	object	—	MCP server configurations for the evaluation

Trigger-precision evals

Set inject_skill_body: false when the eval is measuring whether the agent invokes a skill rather than whether it can complete the work after already seeing the skill body:

name: xyz-trigger
description: Trigger-precision tasks for the xyz skill
skill: xyz
schemaVersion: "1.0"
version: "1.0"

config:
  trials_per_task: 1
  timeout_seconds: 300
  parallel: false
  executor: copilot-sdk
  model: claude-sonnet-4.6
  inject_skill_body: false

metrics:
  - name: trigger_precision
    weight: 1.0
    threshold: 0.8

tasks:
  - "tasks/trigger/*.yaml"

With this setting, Waza still discovers skills, passes them to the Copilot SDK, and includes the compact <available_skills> summary with skill names and descriptions. It only suppresses the full target <skill_context> body, so behavior graders with required_tools or forbidden_tools and skill_invocation graders can observe whether the skill tool was used. If you also set disabled_skills: ["*"], all skill loading is disabled and this setting has no effect.

Common Timeouts:

60 — Quick tasks (single-file review, validation)
300 — Standard tasks (code explanation, analysis)
600 — Complex tasks (multi-file refactoring, design)

MCP Mock Servers

Use top-level mcp_mocks to replace live MCP dependencies with deterministic local stdio servers. This keeps Copilot SDK evals hermetic in CI: no network listener, no port allocation, and no external service credentials. Because this is an additive eval schema field, set schemaVersion: "1.1" or newer.

name: issue-triage-eval
skill: issue-triage
schemaVersion: "1.1"
version: "1.0"

config:
  executor: copilot-sdk
  model: claude-sonnet-4.6

mcp_mocks:
  - name: github
    tools:
      list_issues:
        description: Return matching issues for a repository
        input_schema:
          type: object
          properties:
            owner: { type: string }
            repo: { type: string }
          required: [owner, repo]
        responses:
          - match:
              owner: microsoft
              repo: waza
            return:
              issues:
                - number: 363
                  title: MCP server mocks for hermetic eval
          - match_regex:
              repo: "^waza-.*"
            return:
              issues: []
          - match_schema:
              type: object
              required: [owner, repo]
            error: "No fixture for this repository"

tasks:
  - tasks/*.yaml

Response matching is evaluated in order. match requires exact full-argument equality, match_schema validates the call arguments against an inline JSON Schema, and match_regex applies regular expressions to individual argument fields. Unknown tools or calls that do not match any response fail loudly with an MCP tool error that points to the missing mock fixture.

You can keep larger fixtures in JSON files instead of inline YAML:

mcp_mocks:
  - name: github
    fixtures: fixtures/mcp/github

Each .json fixture can either contain a single tool definition (tool name from the filename) or a { "tools": { ... } } object with multiple tools.

Adversarial Packs

Use top-level adversarial to pin the built-in fault-injection packs and unsafe-outcome policy for waza adversarial --spec. This is additive in schemaVersion: "1.2" and is ignored by normal waza run executions.

schemaVersion: "1.2"
adversarial:
  packs:
    - prompt-injection
    - scope-bypass
  on_unsafe_outcome: fail  # or "warn"

packs must name one or more built-in packs. In v0.38.0 those are prompt-injection and scope-bypass. on_unsafe_outcome: fail exits 2 when an unsafe outcome is observed; warn records the unsafe result but exits 0. See the Adversarial Harness guide for pack behavior and CI examples.

Graders Section

Graders validate task outputs. Define once, reuse across tasks:

graders:
  - type: text
    name: checks_logic
    weight: 2.0
    config:
      regex_match:
        - "(?i)(function|variable|parameter)"

  - type: code
    name: has_minimum_output
    config:
      assertions:
        - "len(output) > 100"
        - "'success' in output.lower()"

  - type: text
    name: mentions_key_concepts
    config:
      contains:
        - "algorithm"
        - "optimization"

Each grader accepts an optional weight (default 1.0) that controls its influence on the composite score. See Validators & Graders for details.

All graders return:

score: 0.0 to 1.0
passed: boolean
message: human-readable result

See the Validators & Graders guide for all 12 types and examples.

Mapping OpenAI Evals `modelgraded` YAML

OpenAI Evals modelgraded specs usually collapse into Waza’s prompt grader. The judge prompt carries the label semantics, while Waza handles execution and scoring.

OpenAI Evals field	Waza equivalent	Notes
`prompt`	`graders[].config.prompt`	Put the judging instructions directly in the prompt
`choice_strings`	prompt text	List the labels in the judge prompt; Waza’s prompt grader is binary, so the label choice becomes pass/fail guidance
`choice_scores`	prompt text	Encode the scoring rule in the judge prompt; use `pairwise` mode when the comparison is relative
`input_outputs`	`tasks:` entries	Turn each example into one Waza task with its own `inputs.prompt` and expected checks
`eval_type: cot_classify`	`type: prompt`	Use `mode: independent` for one-shot classification
`battle.yaml`	`type: prompt` + `mode: pairwise`	Closest grader match for head-to-head comparison; `waza compare` is still the better run-level report

Translation examples

`fact.yaml`

OpenAI’s registry uses this pattern for fixed-choice factual classification. In Waza, keep the evaluation as a single prompt grader and turn each input/output row into a task:

graders:
  - type: prompt
    name: fact_check
    config:
      prompt: |
        You are checking a multiple-choice answer.
        Valid choices: A, B, C, D, E.
        Call set_waza_grade_pass only if the model's answer matches the correct choice.
        Otherwise call set_waza_grade_fail with a short reason.
      continue_session: false

tasks:
  - id: fact-001
    name: fact-001
    inputs:
      prompt: "Which answer is correct for the fact pattern?"
    expected:
      output_contains:
        - "B"

`closedqa.yaml`

For closed-book QA, the judge prompt can encode the score mapping directly:

graders:
  - type: prompt
    name: closedqa_judge
    config:
      prompt: |
        Judge the answer against the reference.
        If the answer is fully correct, call set_waza_grade_pass.
        If it is partially correct or incorrect, call set_waza_grade_fail.
        Treat "Y" as 1.0 and "N" as 0.0 in your reasoning, but only emit pass/fail.
      model: claude-sonnet-4.5

tasks:
  - id: closedqa-001
    name: closedqa-001
    inputs:
      prompt: "Answer the question using the provided context."
    expected:
      output_contains:
        - "Y"

`battle.yaml`

Battle-style comparisons are the one place where the mapping is not 1:1. The nearest Waza translation is a pairwise prompt grader, but the run-level comparison report is usually better expressed with waza compare:

config:
  baseline: true

graders:
  - type: prompt
    name: battle_judge
    config:
      mode: pairwise
      prompt: |
        Compare the two answers and decide which one is better.
        Call set_waza_grade_pass if the skill run wins.
        Call set_waza_grade_fail if the baseline run wins.

tasks:
  - id: battle-001
    name: battle-001
    inputs:
      prompt: "Compare these two solutions and pick the better one."

Tasks Section

Tasks define individual test cases loaded from YAML files:

From Files

Load tasks from YAML files in a directory:

tasks:
  - "tasks/*.yaml" # All YAML files in tasks/
  - "tasks/basic/*.yaml" # Specific subdirectory
  - "tasks/advanced.yaml" # Single file

Task File Format

Individual task files (e.g., tasks/basic-usage.yaml):

id: basic-usage-001
name: Basic Usage - Python Function
description: Test that the skill explains a simple Python function correctly.

tags:
  - basic
  - happy-path

inputs:
  prompt: "Explain this function"
  files:
    - path: sample.py

expected:
  output_contains:
    - "function"
    - "parameter"
    - "return"
  outcomes:
    - type: task_completed
  behavior:
    max_tool_calls: 5

Task Fields

Field	Type	Description
`id`	string	Unique task identifier
`name`	string	Human-readable task name
`description`	string	What the task tests
`tags`	array	Tags for filtering (e.g., `["basic", "edge-case"]`)
`inputs`	object	Test inputs (prompt, files)
`expected`	object	Validation rules and expected behavior
`skill_directories`	string[]	Skill directories for this task (overrides eval-level)
`instruction_files`	string[]	Instruction files for this task (adds to eval-level files)
`golden`	bool	Mark as a critical “golden” task — enforced by `waza gate` (must always pass)

Inputs Section

inputs:
  prompt: "Your instruction to the agent"
  context:
    fixture: fixtures/demo # Optional: copy this fixture file/dir into the workspace
  files:
    - path: sample.py # Fixture file (relative to fixtures dir)
      content: | # Or inline content
        def hello():
            print("Hello")

inputs.context.fixture is resolved relative to the eval spec directory. When it points to a directory, Waza copies that directory’s contents into the fresh task workspace before the agent runs.

Loading prompts from a file

Use prompt_file instead of prompt to load the prompt text from an external file. The path is resolved relative to the task YAML file’s directory.

inputs:
  prompt_file: prompts/review-instructions.md
  files:
    - path: sample.py

This is useful when prompts are long, shared across tasks, or maintained separately. You must specify either prompt or prompt_file, but not both.

Follow-up Prompts

Use follow_up_prompts to send additional messages after the initial prompt. Each follow-up reuses the same session and workspace, so file changes and conversation history persist across turns.

inputs:
  prompt: "Create a Python function that reads a CSV file"
  follow_up_prompts:
    - "Add error handling for missing files"
    - "Write unit tests for the function"

This is useful for evaluating multi-turn conversations where each step builds on the previous one. Graders run only after all prompts (initial + follow-ups) have completed, so the final output reflects the full conversation.

Responder (Interactive Skills)

For skills that ask follow-up questions, configure a responder — an LLM that plays the user and answers the skill’s questions. It is mutually exclusive with follow_up_prompts.

inputs:
  prompt: "Add a new agent to my application"
  responder:
    model: gpt-4o          # optional; defaults to config.model
    instructions: |
      The agent you want is "research-agent" with system instructions
      "Search the web and summarise findings", tools web_search + url_fetch,
      and no handoffs. Answer the skill's questions consistently with this.
      If you genuinely can't infer an answer, abstain.
    max_followups: 8

After each agent turn the responder either replies (the answer is sent back, continuing the conversation), stops (the agent is done), or abstains — which fails the run with a distinct abstained outcome, signalling the brief is too vague. If max_followups is reached while the agent is still asking questions, the loop stops with outcome cap_exhausted and graders evaluate the final state. Each task carries its own responder, so the same skill can be tested against several target configurations.

Per-Turn Checkpoints

By default graders only run once, against the final state of a multi-turn conversation. For long conversations you can run graders at specific turn boundaries using a top-level checkpoints: list:

checkpoints:
  - after_turn: 1
    graders:
      - type: text
        contains: ["analyzing", "files"]
  - after_turn: 2
    on_failure: stop # abort the run if this checkpoint fails
    graders:
      - type: tool_calls
        required: ["read_file"]

Each checkpoint accepts:

Field	Type	Description
`after_turn`	int	1-based turn number this checkpoint runs after (initial prompt is turn 1).
`graders`	array	Inline graders, same schema as the task-level `graders:` / eval-level `graders:` field.
`on_failure`	string	`continue` (default) or `stop` — abort remaining turns when this checkpoint fails.

Outcomes are recorded per-checkpoint on results.json under checkpoints[], alongside the final validations. waza gate still uses final-pass status. Available with schemaVersion: "1.1" and above (additive — older 1.0 files load unchanged).

Prompt supports templating:

inputs:
  prompt: |
    Explain this code:
    {{fixture:sample.py}}

Expected Section

expected:
  # Strings that must appear in output
  output_contains:
    - "function"
    - "parameter"

  # Output must NOT contain these
  output_not_contains:
    - "error"
    - "failed"

  # At least one of these must appear (flexible matching)
  output_contains_any:
    - "recursion"
    - "iteration"
    - "loop"

  # Task outcomes
  outcomes:
    - type: task_completed
    - type: tool_called
      tool_name: code_analyzer

  # Behavioral constraints
  behavior:
    max_tool_calls: 5
    max_tokens: 4096

output_contains vs output_contains_any

output_contains — ALL listed strings must appear (AND logic). Use for required content.
output_contains_any — At least ONE listed string must appear (OR logic). Use when the agent may express concepts in different ways.

All checks are case-insensitive.

Fixture Isolation

Fixtures are test files (code, documents, data) that tasks reference.

Important: Each task gets a fresh temp workspace with fixtures copied in. Original fixtures are never modified.

Using Fixtures

Create a fixtures/ directory:

evals/code-explainer/
├── eval.yaml
├── tasks/
│   └── basic-usage.yaml
└── fixtures/
    ├── sample.py
    ├── complex.py
    └── README.md

Reference in tasks:

inputs:
  prompt: "Analyze {{fixture:sample.py}}"
  files:
    - path: sample.py

Instruction Files

Use instruction_files for repository or task-specific *.instructions.md guidance:

config:
  instruction_files:
    - .github/instructions/project.instructions.md

instruction_files:
  - .github/instructions/review.instructions.md
inputs:
  prompt: "Review this change"
  files:
    - path: sample.py

Instruction files are resolved from the active fixtures/context directory, copied into each fresh temp workspace, and appended to the agent system message with path labels. Eval-level files apply to every task; task-level files are added for that task. Paths must be relative and cannot use directory traversal.

Directory Structure

# Project mode
evals/
└── code-explainer/
    ├── eval.yaml
    ├── tasks/
    │   ├── basic-usage.yaml
    │   ├── edge-case.yaml
    │   └── should-not-trigger.yaml
    └── fixtures/
        ├── sample.py
        ├── complex.py
        └── nested/
            └── module.py

Specify context directory when running:

waza run eval.yaml --context-dir evals/code-explainer/fixtures

Or use relative paths in eval.yaml if fixtures are adjacent.

Multi-Model Comparison

Run the same eval against multiple models:

# Run with gpt-4o
waza run eval.yaml --model gpt-4o -o gpt4.json

# Run with Claude
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json

# Compare results
waza compare gpt4.json sonnet.json

Override the default model in eval.yaml:

waza run eval.yaml --model gpt-4o  # Overrides config.model

Filtering and Parallel Execution

Filter by Task Name

waza run eval.yaml --task "basic*" --task "edge*"

Filter by Tags

waza run eval.yaml --tags "happy-path"

Parallel Execution

# Run tasks concurrently with auto-sized workers
waza run eval.yaml --parallel

Saving Results

Save eval results for later analysis or comparison:

waza run eval.yaml -o results.json

Output format:

{
  "name": "code-explainer-eval",
  "model": "claude-sonnet-4.6",
  "pass_rate": 0.8,
  "tasks": [
    {
      "id": "basic-001",
      "name": "Basic Usage",
      "passed": true,
      "graders": [
        {
          "name": "checks_logic",
          "passed": true,
          "score": 1.0
        }
      ]
    }
  ]
}

Caching

For iterative testing, cache results:

waza run eval.yaml --cache --cache-dir .waza-cache

Only tasks with changed inputs/config re-run.

Common Patterns

Simple Validation

graders:
  - type: text
    name: format_check
    config:
      regex_match:
        - "^[A-Z].*\\.$" # Sentence starting with capital, ending with period

tasks:
  - "tasks/format/*.yaml"

Multi-Criteria Scoring

graders:
  - type: code
    name: completeness
    config:
      assertions:
        - "len(output) > 500"
        - "'function' in output"
        - "'parameter' in output"

tasks:
  - "tasks/completeness/*.yaml"

Behavioral Constraints

Behavioral constraints are defined in individual task YAML files:

id: efficient-001
name: Efficiency test
inputs:
  prompt: "Refactor this code"
expected:
  behavior:
    max_tool_calls: 3 # Efficient
    max_tokens: 1000 # Concise
    max_response_time_ms: 30000 # Must complete within 30 seconds
    required_tools: # Must use these tools
      - grep
      - edit
    forbidden_tools: # Must NOT use these tools
      - rm

Field	Type	Description
`max_tool_calls`	int	Maximum number of tool invocations allowed
`max_iterations`	int	Maximum number of conversation rounds (turns)
`max_tokens`	int	Maximum tokens in the response
`max_response_time_ms`	int	Maximum wall-clock execution time in milliseconds
`required_tools`	string[]	Tools the agent must use during the task
`forbidden_tools`	string[]	Tools the agent must NOT use during the task

Each constraint that is set (non-zero / non-empty) contributes equally to the behavior efficiency score. If all constraints pass, the score is 1.0; each failure reduces it proportionally.

Hooks

Lifecycle hooks run shell commands at specific points during an evaluation. Use them for setup, teardown, or validation.

hooks:
  before_run:
    - command: "npm install"
      working_directory: "./fixtures"
      error_on_fail: true
  after_run:
    - command: "bash cleanup.sh"
  before_task:
    - command: "echo Starting task"
  after_task:
    - command: "bash collect-metrics.sh"

Hook	When it runs
`before_run`	Once, before the entire evaluation starts
`after_run`	Once, after all tasks complete
`before_task`	Before each individual task
`after_task`	After each individual task

Each hook entry:

Field	Type	Default	Description
`command`	string	(required)	Shell command to execute
`working_directory`	string	`.`	Working directory for the command
`exit_codes`	list[int]	`[0]`	Acceptable exit codes
`error_on_fail`	bool	false	Abort the run if this hook fails

Template Variables

Use the inputs field to define global template variables that are substituted into task prompts:

inputs:
  language: python
  framework: fastapi

tasks:
  - "tasks/scaffold/*.yaml"

Prompt templating also supports fixture file injection:

inputs:
  prompt: |
    Explain this code:
    {{fixture:sample.py}}

The {{fixture:filename}} syntax inlines the content of a file from the fixtures directory into the prompt.

External Task Lists

Use tasks_from to load task definitions from a separate YAML file:

name: shared-eval
tasks_from: shared-tasks.yaml

config:
  trials_per_task: 3
  model: claude-sonnet-4.6

This is useful when multiple eval specs share the same task set but differ in config or graders.

Best Practices

Clear task descriptions — Future reviewers should understand what’s being tested
Realistic validators — Don’t over-specify. A few key checks beat 20 strict rules
Fixture diversity — Include basic, edge case, and negative test fixtures
Tag your tasks — Makes filtering and analysis easier
Use timeout appropriately — Too short = false failures, too long = slow tests
Reuse graders — Define once, apply across multiple tasks
Version your evals — Track improvements with version numbers

Next Steps

Validators & Graders — Reference for all grader types
Web Dashboard — Explore results interactively
CLI Reference — All commands and flags