Writing Eval Specs
A complete reference for writing eval.yaml specifications and task definitions.
eval.yaml Structure
Section titled “eval.yaml Structure”The evaluation spec defines the benchmark configuration, graders, and task files:
name: code-explainer-evaldescription: Evaluation suite for code-explainer skillskill: code-explainerversion: "1.0"
config: trials_per_task: 1 timeout_seconds: 300 parallel: false executor: mock model: claude-sonnet-4.6
metrics: - name: accuracy weight: 1.0 threshold: 0.8
graders: - type: text name: explains_concepts config: regex_match: - "(?i)(function|logic|parameter)" - type: code name: has_output config: assertions: - "len(output) > 100"
tasks: - "tasks/*.yaml"Top-Level Fields
Section titled “Top-Level Fields”| Field | Type | Required | Description |
|---|---|---|---|
name | string | ✓ | Eval suite name |
description | string | ✗ | What the eval tests |
skill | string | ✓ | Associated skill or custom agent name (SKILL.md or .agent.md) |
version | string | ✗ | Version number (e.g., “1.0”) |
inputs | object | ✗ | Key-value map of global template variables (see Template Variables) |
tasks_from | string | ✗ | Path to an external YAML file containing the task list |
hooks | object | ✗ | Lifecycle hooks that run shell commands at specific points (see Hooks) |
baseline | bool | ✗ | Mark this spec as a baseline for A/B comparison |
skill is required.
Targeting Custom Agents
Section titled “Targeting Custom Agents”Waza supports evaluating VS Code custom agents (.agent.md files) alongside traditional SKILL.md-based skills. When you target an agent with a tools: field in its frontmatter, waza automatically injects a tool_constraint grader to validate that only the declared tools are called.
Specify an agent:
name: security-agent-evaldescription: Evaluate the security-reviewer custom agentskill: security-reviewer # Points to security-reviewer.agent.mdversion: "1.0"
config: model: claude-sonnet-4.6Key differences:
- Use
skill: <name>to target either a skill or a custom agent - Waza discovers
.agent.mdfiles the same way asSKILL.md— in the current directory oragents/subdirectories - If both
SKILL.mdand.agent.mdexist in the same directory,SKILL.mdtakes priority - Custom agents can declare a
tools:field in frontmatter, which auto-injects atool_constraintgrader
Learn more: See the Evaluating Custom Agents guide for detailed examples and the auto-injected tool constraint behavior.
Config Section
Section titled “Config Section”The config block controls execution behavior:
config: trials_per_task: 1 # Run each task this many times timeout_seconds: 300 # Task timeout in seconds parallel: false # Run tasks sequentially (true = concurrent) workers: 4 # Parallel workers if parallel: true model: claude-sonnet-4.6 # Default model (override with --model) judge_model: gpt-4o # Model for LLM-as-judge graders (optional) executor: mock # mock (local) or copilot-sdk (real API) instruction_files: - .github/instructions/project.instructions.md| Field | Type | Default | Description |
|---|---|---|---|
trials_per_task | int | 1 | Number of times each task runs (for statistical analysis) |
timeout_seconds | int | 300 | Task timeout in seconds |
parallel | bool | false | Run tasks concurrently |
workers | int | 4 | Number of parallel workers |
model | string | required | Default model for tasks (override with --model flag) |
judge_model | string | (same as model) | Model for prompt-type graders (LLM-as-judge) |
executor | string | copilot-sdk | Executor: mock (local, echoes task metadata and file content) or copilot-sdk (real API) |
max_attempts | int | 0 | Maximum retry attempts per task on failure (0 = no retries) |
group_by | string | — | Group results by a field (e.g., tags, task_id) |
fail_fast | bool | false | Stop the entire run on first task failure |
skill_directories | list[str] | [] | Additional directories to search for skills |
instruction_files | list[str] | [] | Instruction files to apply to every task |
disabled_skills | list[str] | [] | Skills to disable. Use ["*"] to disable all skills |
required_skills | list[str] | [] | Skills that must be available before running |
mcp_servers | object | — | MCP server configurations for the evaluation |
Common Timeouts:
60— Quick tasks (single-file review, validation)300— Standard tasks (code explanation, analysis)600— Complex tasks (multi-file refactoring, design)
Graders Section
Section titled “Graders Section”Graders validate task outputs. Define once, reuse across tasks:
graders: - type: text name: checks_logic weight: 2.0 config: regex_match: - "(?i)(function|variable|parameter)"
- type: code name: has_minimum_output config: assertions: - "len(output) > 100" - "'success' in output.lower()"
- type: text name: mentions_key_concepts config: contains: - "algorithm" - "optimization"Each grader accepts an optional weight (default 1.0) that controls its influence on the composite score. See Validators & Graders for details.
All graders return:
score: 0.0 to 1.0passed: booleanmessage: human-readable result
See the Validators & Graders guide for all 12 types and examples.
Mapping OpenAI Evals modelgraded YAML
Section titled “Mapping OpenAI Evals modelgraded YAML”OpenAI Evals modelgraded specs usually collapse into Waza’s prompt grader. The judge prompt carries the label semantics, while Waza handles execution and scoring.
| OpenAI Evals field | Waza equivalent | Notes |
|---|---|---|
prompt | graders[].config.prompt | Put the judging instructions directly in the prompt |
choice_strings | prompt text | List the labels in the judge prompt; Waza’s prompt grader is binary, so the label choice becomes pass/fail guidance |
choice_scores | prompt text | Encode the scoring rule in the judge prompt; use pairwise mode when the comparison is relative |
input_outputs | tasks: entries | Turn each example into one Waza task with its own inputs.prompt and expected checks |
eval_type: cot_classify | type: prompt | Use mode: independent for one-shot classification |
battle.yaml | type: prompt + mode: pairwise | Closest grader match for head-to-head comparison; waza compare is still the better run-level report |
Translation examples
Section titled “Translation examples”fact.yaml
Section titled “fact.yaml”OpenAI’s registry uses this pattern for fixed-choice factual classification. In Waza, keep the evaluation as a single prompt grader and turn each input/output row into a task:
graders: - type: prompt name: fact_check config: prompt: | You are checking a multiple-choice answer. Valid choices: A, B, C, D, E. Call set_waza_grade_pass only if the model's answer matches the correct choice. Otherwise call set_waza_grade_fail with a short reason. continue_session: false
tasks: - id: fact-001 name: fact-001 inputs: prompt: "Which answer is correct for the fact pattern?" expected: output_contains: - "B"closedqa.yaml
Section titled “closedqa.yaml”For closed-book QA, the judge prompt can encode the score mapping directly:
graders: - type: prompt name: closedqa_judge config: prompt: | Judge the answer against the reference. If the answer is fully correct, call set_waza_grade_pass. If it is partially correct or incorrect, call set_waza_grade_fail. Treat "Y" as 1.0 and "N" as 0.0 in your reasoning, but only emit pass/fail. model: claude-sonnet-4.5
tasks: - id: closedqa-001 name: closedqa-001 inputs: prompt: "Answer the question using the provided context." expected: output_contains: - "Y"battle.yaml
Section titled “battle.yaml”Battle-style comparisons are the one place where the mapping is not 1:1. The nearest Waza translation is a pairwise prompt grader, but the run-level comparison report is usually better expressed with waza compare:
config: baseline: true
graders: - type: prompt name: battle_judge config: mode: pairwise prompt: | Compare the two answers and decide which one is better. Call set_waza_grade_pass if the skill run wins. Call set_waza_grade_fail if the baseline run wins.
tasks: - id: battle-001 name: battle-001 inputs: prompt: "Compare these two solutions and pick the better one."Tasks Section
Section titled “Tasks Section”Tasks define individual test cases loaded from YAML files:
From Files
Section titled “From Files”Load tasks from YAML files in a directory:
tasks: - "tasks/*.yaml" # All YAML files in tasks/ - "tasks/basic/*.yaml" # Specific subdirectory - "tasks/advanced.yaml" # Single fileTask File Format
Section titled “Task File Format”Individual task files (e.g., tasks/basic-usage.yaml):
id: basic-usage-001name: Basic Usage - Python Functiondescription: Test that the skill explains a simple Python function correctly.
tags: - basic - happy-path
inputs: prompt: "Explain this function" files: - path: sample.py
expected: output_contains: - "function" - "parameter" - "return" outcomes: - type: task_completed behavior: max_tool_calls: 5Task Fields
Section titled “Task Fields”| Field | Type | Description |
|---|---|---|
id | string | Unique task identifier |
name | string | Human-readable task name |
description | string | What the task tests |
tags | array | Tags for filtering (e.g., ["basic", "edge-case"]) |
inputs | object | Test inputs (prompt, files) |
expected | object | Validation rules and expected behavior |
skill_directories | string[] | Skill directories for this task (overrides eval-level) |
instruction_files | string[] | Instruction files for this task (adds to eval-level files) |
Inputs Section
Section titled “Inputs Section”inputs: prompt: "Your instruction to the agent" files: - path: sample.py # Fixture file (relative to fixtures dir) content: | # Or inline content def hello(): print("Hello")Loading prompts from a file
Section titled “Loading prompts from a file”Use prompt_file instead of prompt to load the prompt text from an external file.
The path is resolved relative to the task YAML file’s directory.
inputs: prompt_file: prompts/review-instructions.md files: - path: sample.pyThis is useful when prompts are long, shared across tasks, or maintained separately.
You must specify either prompt or prompt_file, but not both.
Follow-up Prompts
Section titled “Follow-up Prompts”Use follow_up_prompts to send additional messages after the initial prompt. Each follow-up reuses the same session and workspace, so file changes and conversation history persist across turns.
inputs: prompt: "Create a Python function that reads a CSV file" follow_up_prompts: - "Add error handling for missing files" - "Write unit tests for the function"This is useful for evaluating multi-turn conversations where each step builds on the previous one. Graders run only after all prompts (initial + follow-ups) have completed, so the final output reflects the full conversation.
Prompt supports templating:
inputs: prompt: | Explain this code: {{fixture:sample.py}}Expected Section
Section titled “Expected Section”expected: # Strings that must appear in output output_contains: - "function" - "parameter"
# Output must NOT contain these output_not_contains: - "error" - "failed"
# At least one of these must appear (flexible matching) output_contains_any: - "recursion" - "iteration" - "loop"
# Task outcomes outcomes: - type: task_completed - type: tool_called tool_name: code_analyzer
# Behavioral constraints behavior: max_tool_calls: 5 max_tokens: 4096output_contains vs output_contains_any
Section titled “output_contains vs output_contains_any”output_contains— ALL listed strings must appear (AND logic). Use for required content.output_contains_any— At least ONE listed string must appear (OR logic). Use when the agent may express concepts in different ways.
All checks are case-insensitive.
Fixture Isolation
Section titled “Fixture Isolation”Fixtures are test files (code, documents, data) that tasks reference.
Important: Each task gets a fresh temp workspace with fixtures copied in. Original fixtures are never modified.
Using Fixtures
Section titled “Using Fixtures”Create a fixtures/ directory:
evals/code-explainer/├── eval.yaml├── tasks/│ └── basic-usage.yaml└── fixtures/ ├── sample.py ├── complex.py └── README.mdReference in tasks:
inputs: prompt: "Analyze {{fixture:sample.py}}" files: - path: sample.pyInstruction Files
Section titled “Instruction Files”Use instruction_files for repository or task-specific *.instructions.md guidance:
config: instruction_files: - .github/instructions/project.instructions.mdinstruction_files: - .github/instructions/review.instructions.mdinputs: prompt: "Review this change" files: - path: sample.pyInstruction files are resolved from the active fixtures/context directory, copied into each fresh temp workspace, and appended to the agent system message with path labels. Eval-level files apply to every task; task-level files are added for that task. Paths must be relative and cannot use directory traversal.
Directory Structure
Section titled “Directory Structure”# Project modeevals/└── code-explainer/ ├── eval.yaml ├── tasks/ │ ├── basic-usage.yaml │ ├── edge-case.yaml │ └── should-not-trigger.yaml └── fixtures/ ├── sample.py ├── complex.py └── nested/ └── module.pySpecify context directory when running:
waza run eval.yaml --context-dir evals/code-explainer/fixturesOr use relative paths in eval.yaml if fixtures are adjacent.
Multi-Model Comparison
Section titled “Multi-Model Comparison”Run the same eval against multiple models:
# Run with gpt-4owaza run eval.yaml --model gpt-4o -o gpt4.json
# Run with Claudewaza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
# Compare resultswaza compare gpt4.json sonnet.jsonOverride the default model in eval.yaml:
waza run eval.yaml --model gpt-4o # Overrides config.modelFiltering and Parallel Execution
Section titled “Filtering and Parallel Execution”Filter by Task Name
Section titled “Filter by Task Name”waza run eval.yaml --task "basic*" --task "edge*"Filter by Tags
Section titled “Filter by Tags”waza run eval.yaml --tags "happy-path"Parallel Execution
Section titled “Parallel Execution”# Run tasks concurrently with 4 workerswaza run eval.yaml --parallel --workers 4Saving Results
Section titled “Saving Results”Save eval results for later analysis or comparison:
waza run eval.yaml -o results.jsonOutput format:
{ "name": "code-explainer-eval", "model": "claude-sonnet-4.6", "pass_rate": 0.8, "tasks": [ { "id": "basic-001", "name": "Basic Usage", "passed": true, "graders": [ { "name": "checks_logic", "passed": true, "score": 1.0 } ] } ]}Caching
Section titled “Caching”For iterative testing, cache results:
waza run eval.yaml --cache --cache-dir .waza-cacheOnly tasks with changed inputs/config re-run.
Common Patterns
Section titled “Common Patterns”Simple Validation
Section titled “Simple Validation”graders: - type: text name: format_check config: regex_match: - "^[A-Z].*\\.$" # Sentence starting with capital, ending with period
tasks: - "tasks/format/*.yaml"Multi-Criteria Scoring
Section titled “Multi-Criteria Scoring”graders: - type: code name: completeness config: assertions: - "len(output) > 500" - "'function' in output" - "'parameter' in output"
tasks: - "tasks/completeness/*.yaml"Behavioral Constraints
Section titled “Behavioral Constraints”Behavioral constraints are defined in individual task YAML files:
id: efficient-001name: Efficiency testinputs: prompt: "Refactor this code"expected: behavior: max_tool_calls: 3 # Efficient max_tokens: 1000 # Concise max_response_time_ms: 30000 # Must complete within 30 seconds required_tools: # Must use these tools - grep - edit forbidden_tools: # Must NOT use these tools - rm| Field | Type | Description |
|---|---|---|
max_tool_calls | int | Maximum number of tool invocations allowed |
max_iterations | int | Maximum number of conversation rounds (turns) |
max_tokens | int | Maximum tokens in the response |
max_response_time_ms | int | Maximum wall-clock execution time in milliseconds |
required_tools | string[] | Tools the agent must use during the task |
forbidden_tools | string[] | Tools the agent must NOT use during the task |
Each constraint that is set (non-zero / non-empty) contributes equally to the behavior efficiency score. If all constraints pass, the score is 1.0; each failure reduces it proportionally.
Lifecycle hooks run shell commands at specific points during an evaluation. Use them for setup, teardown, or validation.
hooks: before_run: - command: "npm install" working_directory: "./fixtures" error_on_fail: true after_run: - command: "bash cleanup.sh" before_task: - command: "echo Starting task" after_task: - command: "bash collect-metrics.sh"| Hook | When it runs |
|---|---|
before_run | Once, before the entire evaluation starts |
after_run | Once, after all tasks complete |
before_task | Before each individual task |
after_task | After each individual task |
Each hook entry:
| Field | Type | Default | Description |
|---|---|---|---|
command | string | (required) | Shell command to execute |
working_directory | string | . | Working directory for the command |
exit_codes | list[int] | [0] | Acceptable exit codes |
error_on_fail | bool | false | Abort the run if this hook fails |
Template Variables
Section titled “Template Variables”Use the inputs field to define global template variables that are substituted into task prompts:
inputs: language: python framework: fastapi
tasks: - "tasks/scaffold/*.yaml"Prompt templating also supports fixture file injection:
inputs: prompt: | Explain this code: {{fixture:sample.py}}The {{fixture:filename}} syntax inlines the content of a file from the fixtures directory into the prompt.
External Task Lists
Section titled “External Task Lists”Use tasks_from to load task definitions from a separate YAML file:
name: shared-evaltasks_from: shared-tasks.yaml
config: trials_per_task: 3 model: claude-sonnet-4.6This is useful when multiple eval specs share the same task set but differ in config or graders.
Best Practices
Section titled “Best Practices”- Clear task descriptions — Future reviewers should understand what’s being tested
- Realistic validators — Don’t over-specify. A few key checks beat 20 strict rules
- Fixture diversity — Include basic, edge case, and negative test fixtures
- Tag your tasks — Makes filtering and analysis easier
- Use timeout appropriately — Too short = false failures, too long = slow tests
- Reuse graders — Define once, apply across multiple tasks
- Version your evals — Track improvements with version numbers
Next Steps
Section titled “Next Steps”- Validators & Graders — Reference for all grader types
- Web Dashboard — Explore results interactively
- CLI Reference — All commands and flags