Writing Eval Specs
A complete reference for writing eval.yaml specifications and task definitions.
eval.yaml Structure
Section titled “eval.yaml Structure”The evaluation spec defines the benchmark configuration, graders, and task files:
name: code-explainer-evaldescription: Evaluation suite for code-explainer skillskill: code-explainerversion: "1.0"
config: trials_per_task: 1 timeout_seconds: 300 parallel: false model: claude-sonnet-4.6
graders: - type: text name: explains_concepts config: pattern: "(?i)(function|logic|parameter)" - type: code name: has_output config: assertions: - "len(output) > 100"
tasks: - "tasks/*.yaml"Top-Level Fields
Section titled “Top-Level Fields”| Field | Type | Required | Description |
|---|---|---|---|
name | string | ✓ | Eval suite name |
description | string | ✓ | What the eval tests |
skill | string | ✓ | Associated skill name |
version | string | ✗ | Version number (e.g., “1.0”) |
inputs | object | ✗ | Key-value map of global template variables (see Template Variables) |
tasks_from | string | ✗ | Path to an external YAML file containing the task list |
hooks | object | ✗ | Lifecycle hooks that run shell commands at specific points (see Hooks) |
baseline | bool | ✗ | Mark this spec as a baseline for A/B comparison |
Config Section
Section titled “Config Section”The config block controls execution behavior:
config: trials_per_task: 1 # Run each task this many times timeout_seconds: 300 # Task timeout in seconds parallel: false # Run tasks sequentially (true = concurrent) workers: 4 # Parallel workers if parallel: true model: claude-sonnet-4.6 # Default model (override with --model) judge_model: gpt-4o # Model for LLM-as-judge graders (optional) executor: mock # mock (local) or copilot-sdk (real API)| Field | Type | Default | Description |
|---|---|---|---|
trials_per_task | int | 1 | Number of times each task runs (for statistical analysis) |
timeout_seconds | int | 300 | Task timeout in seconds |
parallel | bool | false | Run tasks concurrently |
workers | int | 4 | Number of parallel workers |
model | string | required | Default model for tasks (override with --model flag) |
judge_model | string | (same as model) | Model for prompt-type graders (LLM-as-judge) |
executor | string | copilot-sdk | Executor: mock (local, fast) or copilot-sdk (real API) |
max_attempts | int | 0 | Maximum retry attempts per task on failure (0 = no retries) |
group_by | string | — | Group results by a field (e.g., tags, task_id) |
fail_fast | bool | false | Stop the entire run on first task failure |
skill_directories | list[str] | [] | Additional directories to search for skills |
required_skills | list[str] | [] | Skills that must be available before running |
mcp_servers | object | — | MCP server configurations for the evaluation |
Common Timeouts:
60— Quick tasks (single-file review, validation)300— Standard tasks (code explanation, analysis)600— Complex tasks (multi-file refactoring, design)
Graders Section
Section titled “Graders Section”Graders validate task outputs. Define once, reuse across tasks:
graders: - type: text name: checks_logic weight: 2.0 config: pattern: "(?i)(function|variable|parameter)"
- type: code name: has_minimum_output config: assertions: - "len(output) > 100" - "'success' in output.lower()"
- type: text name: mentions_key_concepts config: keywords: ["algorithm", "optimization"] must_include_all: trueEach grader accepts an optional weight (default 1.0) that controls its influence on the composite score. See Validators & Graders for details.
All graders return:
score: 0.0 to 1.0passed: booleanmessage: human-readable result
See the Validators & Graders guide for all 12 types and examples.
Tasks Section
Section titled “Tasks Section”Tasks define individual test cases. Either inline or from files:
Inline Tasks
Section titled “Inline Tasks”tasks: - id: basic-001 name: Basic Usage description: Test basic functionality inputs: prompt: "Explain this code" files: - path: sample.py expected: output_contains: - "function" - "variable" behavior: max_tool_calls: 5From Files
Section titled “From Files”Load tasks from YAML files in a directory:
tasks: - "tasks/*.yaml" # All YAML files in tasks/ - "tasks/basic/*.yaml" # Specific subdirectory - "tasks/advanced.yaml" # Single fileTask File Format
Section titled “Task File Format”Individual task files (e.g., tasks/basic-usage.yaml):
id: basic-usage-001name: Basic Usage - Python Functiondescription: Test that the skill explains a simple Python function correctly.
tags: - basic - happy-path
inputs: prompt: "Explain this function" files: - path: sample.py
expected: output_contains: - "function" - "parameter" - "return" outcomes: - type: task_completed behavior: max_tool_calls: 5 max_response_time_ms: 30000Task Fields
Section titled “Task Fields”| Field | Type | Description |
|---|---|---|
id | string | Unique task identifier |
name | string | Human-readable task name |
description | string | What the task tests |
tags | array | Tags for filtering (e.g., ["basic", "edge-case"]) |
inputs | object | Test inputs (prompt, files) |
expected | object | Validation rules and expected behavior |
Inputs Section
Section titled “Inputs Section”inputs: prompt: "Your instruction to the agent" files: - path: sample.py # Fixture file (relative to fixtures dir) content: | # Or inline content def hello(): print("Hello")Prompt supports templating:
inputs: prompt: | Explain this code: {{fixture:sample.py}}Expected Section
Section titled “Expected Section”expected: # Strings that must appear in output output_contains: - "function" - "parameter"
# Output must NOT contain these output_excludes: - "error" - "failed"
# Regex patterns to match matches: - "returns\\s+.*value" - "def\\s+\\w+\\("
# Task outcomes outcomes: - type: task_completed - type: tool_called tool_name: code_analyzer
# Behavioral constraints behavior: max_tool_calls: 5 max_response_time_ms: 30000 max_tokens: 4096Fixture Isolation
Section titled “Fixture Isolation”Fixtures are test files (code, documents, data) that tasks reference.
Important: Each task gets a fresh temp workspace with fixtures copied in. Original fixtures are never modified.
Using Fixtures
Section titled “Using Fixtures”Create a fixtures/ directory:
evals/code-explainer/├── eval.yaml├── tasks/│ └── basic-usage.yaml└── fixtures/ ├── sample.py ├── complex.py └── README.mdReference in tasks:
inputs: prompt: "Analyze {{fixture:sample.py}}" files: - path: sample.pyDirectory Structure
Section titled “Directory Structure”# Project modeevals/└── code-explainer/ ├── eval.yaml ├── tasks/ │ ├── basic-usage.yaml │ ├── edge-case.yaml │ └── should-not-trigger.yaml └── fixtures/ ├── sample.py ├── complex.py └── nested/ └── module.pySpecify context directory when running:
waza run eval.yaml --context-dir evals/code-explainer/fixturesOr use relative paths in eval.yaml if fixtures are adjacent.
Multi-Model Comparison
Section titled “Multi-Model Comparison”Run the same eval against multiple models:
# Run with gpt-4owaza run eval.yaml --model gpt-4o -o gpt4.json
# Run with Claudewaza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
# Compare resultswaza compare gpt4.json sonnet.jsonOverride the default model in eval.yaml:
waza run eval.yaml --model gpt-4o # Overrides config.modelFiltering and Parallel Execution
Section titled “Filtering and Parallel Execution”Filter by Task Name
Section titled “Filter by Task Name”waza run eval.yaml --task "basic*" --task "edge*"Filter by Tags
Section titled “Filter by Tags”waza run eval.yaml --tags "happy-path"Parallel Execution
Section titled “Parallel Execution”# Run tasks concurrently with 4 workerswaza run eval.yaml --parallel --workers 4Saving Results
Section titled “Saving Results”Save eval results for later analysis or comparison:
waza run eval.yaml -o results.jsonOutput format:
{ "name": "code-explainer-eval", "model": "claude-sonnet-4.6", "pass_rate": 0.8, "tasks": [ { "id": "basic-001", "name": "Basic Usage", "passed": true, "graders": [ { "name": "checks_logic", "passed": true, "score": 1.0 } ] } ]}Caching
Section titled “Caching”For iterative testing, cache results:
waza run eval.yaml --cache --cache-dir .waza-cacheOnly tasks with changed inputs/config re-run.
Common Patterns
Section titled “Common Patterns”Simple Validation
Section titled “Simple Validation”graders: - type: text name: format_check config: pattern: "^[A-Z].*\\.$" # Sentence starting with capital, ending with period
tasks: - id: format-001 inputs: prompt: "Write a single sentence" expected: matches: - "^[A-Z].*\\.$"Multi-Criteria Scoring
Section titled “Multi-Criteria Scoring”graders: - type: code name: completeness config: assertions: - "len(output) > 500" - "'function' in output" - "'parameter' in output"
tasks: - id: complete-001 inputs: prompt: "Explain this function" expected: # All 3 assertions must passBehavioral Constraints
Section titled “Behavioral Constraints”tasks: - id: efficient-001 inputs: prompt: "Refactor this code" expected: behavior: max_tool_calls: 3 # Efficient max_response_time_ms: 5000 # Quick max_tokens: 1000 # ConciseLifecycle hooks run shell commands at specific points during an evaluation. Use them for setup, teardown, or validation.
hooks: before_run: - command: "npm install" working_directory: "./fixtures" error_on_fail: true after_run: - command: "bash cleanup.sh" before_task: - command: "echo Starting task" after_task: - command: "bash collect-metrics.sh"| Hook | When it runs |
|---|---|
before_run | Once, before the entire evaluation starts |
after_run | Once, after all tasks complete |
before_task | Before each individual task |
after_task | After each individual task |
Each hook entry:
| Field | Type | Default | Description |
|---|---|---|---|
command | string | (required) | Shell command to execute |
working_directory | string | . | Working directory for the command |
exit_codes | list[int] | [0] | Acceptable exit codes |
error_on_fail | bool | false | Abort the run if this hook fails |
Template Variables
Section titled “Template Variables”Use the inputs field to define global template variables that are substituted into task prompts:
inputs: language: python framework: fastapi
tasks: - id: scaffold-001 inputs: prompt: "Create a {{language}} app using {{framework}}"Prompt templating also supports fixture file injection:
inputs: prompt: | Explain this code: {{fixture:sample.py}}The {{fixture:filename}} syntax inlines the content of a file from the fixtures directory into the prompt.
External Task Lists
Section titled “External Task Lists”Use tasks_from to load task definitions from a separate YAML file:
name: shared-evaltasks_from: shared-tasks.yaml
config: trials_per_task: 3 model: claude-sonnet-4.6This is useful when multiple eval specs share the same task set but differ in config or graders.
Best Practices
Section titled “Best Practices”- Clear task descriptions — Future reviewers should understand what’s being tested
- Realistic validators — Don’t over-specify. A few key checks beat 20 strict rules
- Fixture diversity — Include basic, edge case, and negative test fixtures
- Tag your tasks — Makes filtering and analysis easier
- Use timeout appropriately — Too short = false failures, too long = slow tests
- Reuse graders — Define once, apply across multiple tasks
- Version your evals — Track improvements with version numbers
Next Steps
Section titled “Next Steps”- Validators & Graders — Reference for all grader types
- Web Dashboard — Explore results interactively
- CLI Reference — All commands and flags