Skip to content

Writing Eval Specs

A complete reference for writing eval.yaml specifications and task definitions.

The evaluation spec defines the benchmark configuration, graders, and task files:

name: code-explainer-eval
description: Evaluation suite for code-explainer skill
skill: code-explainer
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
executor: mock
model: claude-sonnet-4.6
metrics:
- name: accuracy
weight: 1.0
threshold: 0.8
graders:
- type: text
name: explains_concepts
config:
regex_match:
- "(?i)(function|logic|parameter)"
- type: code
name: has_output
config:
assertions:
- "len(output) > 100"
tasks:
- "tasks/*.yaml"
FieldTypeRequiredDescription
namestringEval suite name
descriptionstringWhat the eval tests
skillstringAssociated skill or custom agent name (SKILL.md or .agent.md)
versionstringVersion number (e.g., “1.0”)
inputsobjectKey-value map of global template variables (see Template Variables)
tasks_fromstringPath to an external YAML file containing the task list
hooksobjectLifecycle hooks that run shell commands at specific points (see Hooks)
baselineboolMark this spec as a baseline for A/B comparison

skill is required.

Waza supports evaluating VS Code custom agents (.agent.md files) alongside traditional SKILL.md-based skills. When you target an agent with a tools: field in its frontmatter, waza automatically injects a tool_constraint grader to validate that only the declared tools are called.

Specify an agent:

name: security-agent-eval
description: Evaluate the security-reviewer custom agent
skill: security-reviewer # Points to security-reviewer.agent.md
version: "1.0"
config:
model: claude-sonnet-4.6

Key differences:

  • Use skill: <name> to target either a skill or a custom agent
  • Waza discovers .agent.md files the same way as SKILL.md — in the current directory or agents/ subdirectories
  • If both SKILL.md and .agent.md exist in the same directory, SKILL.md takes priority
  • Custom agents can declare a tools: field in frontmatter, which auto-injects a tool_constraint grader

Learn more: See the Evaluating Custom Agents guide for detailed examples and the auto-injected tool constraint behavior.

The config block controls execution behavior:

config:
trials_per_task: 1 # Run each task this many times
timeout_seconds: 300 # Task timeout in seconds
parallel: false # Run tasks sequentially (true = concurrent)
workers: 4 # Parallel workers if parallel: true
model: claude-sonnet-4.6 # Default model (override with --model)
judge_model: gpt-4o # Model for LLM-as-judge graders (optional)
executor: mock # mock (local) or copilot-sdk (real API)
instruction_files:
- .github/instructions/project.instructions.md
FieldTypeDefaultDescription
trials_per_taskint1Number of times each task runs (for statistical analysis)
timeout_secondsint300Task timeout in seconds
parallelboolfalseRun tasks concurrently
workersint4Number of parallel workers
modelstringrequiredDefault model for tasks (override with --model flag)
judge_modelstring(same as model)Model for prompt-type graders (LLM-as-judge)
executorstringcopilot-sdkExecutor: mock (local, echoes task metadata and file content) or copilot-sdk (real API)
max_attemptsint0Maximum retry attempts per task on failure (0 = no retries)
group_bystringGroup results by a field (e.g., tags, task_id)
fail_fastboolfalseStop the entire run on first task failure
skill_directorieslist[str][]Additional directories to search for skills
instruction_fileslist[str][]Instruction files to apply to every task
disabled_skillslist[str][]Skills to disable. Use ["*"] to disable all skills
required_skillslist[str][]Skills that must be available before running
mcp_serversobjectMCP server configurations for the evaluation

Common Timeouts:

  • 60 — Quick tasks (single-file review, validation)
  • 300 — Standard tasks (code explanation, analysis)
  • 600 — Complex tasks (multi-file refactoring, design)

Graders validate task outputs. Define once, reuse across tasks:

graders:
- type: text
name: checks_logic
weight: 2.0
config:
regex_match:
- "(?i)(function|variable|parameter)"
- type: code
name: has_minimum_output
config:
assertions:
- "len(output) > 100"
- "'success' in output.lower()"
- type: text
name: mentions_key_concepts
config:
contains:
- "algorithm"
- "optimization"

Each grader accepts an optional weight (default 1.0) that controls its influence on the composite score. See Validators & Graders for details.

All graders return:

  • score: 0.0 to 1.0
  • passed: boolean
  • message: human-readable result

See the Validators & Graders guide for all 12 types and examples.

OpenAI Evals modelgraded specs usually collapse into Waza’s prompt grader. The judge prompt carries the label semantics, while Waza handles execution and scoring.

OpenAI Evals fieldWaza equivalentNotes
promptgraders[].config.promptPut the judging instructions directly in the prompt
choice_stringsprompt textList the labels in the judge prompt; Waza’s prompt grader is binary, so the label choice becomes pass/fail guidance
choice_scoresprompt textEncode the scoring rule in the judge prompt; use pairwise mode when the comparison is relative
input_outputstasks: entriesTurn each example into one Waza task with its own inputs.prompt and expected checks
eval_type: cot_classifytype: promptUse mode: independent for one-shot classification
battle.yamltype: prompt + mode: pairwiseClosest grader match for head-to-head comparison; waza compare is still the better run-level report

OpenAI’s registry uses this pattern for fixed-choice factual classification. In Waza, keep the evaluation as a single prompt grader and turn each input/output row into a task:

graders:
- type: prompt
name: fact_check
config:
prompt: |
You are checking a multiple-choice answer.
Valid choices: A, B, C, D, E.
Call set_waza_grade_pass only if the model's answer matches the correct choice.
Otherwise call set_waza_grade_fail with a short reason.
continue_session: false
tasks:
- id: fact-001
name: fact-001
inputs:
prompt: "Which answer is correct for the fact pattern?"
expected:
output_contains:
- "B"

For closed-book QA, the judge prompt can encode the score mapping directly:

graders:
- type: prompt
name: closedqa_judge
config:
prompt: |
Judge the answer against the reference.
If the answer is fully correct, call set_waza_grade_pass.
If it is partially correct or incorrect, call set_waza_grade_fail.
Treat "Y" as 1.0 and "N" as 0.0 in your reasoning, but only emit pass/fail.
model: claude-sonnet-4.5
tasks:
- id: closedqa-001
name: closedqa-001
inputs:
prompt: "Answer the question using the provided context."
expected:
output_contains:
- "Y"

Battle-style comparisons are the one place where the mapping is not 1:1. The nearest Waza translation is a pairwise prompt grader, but the run-level comparison report is usually better expressed with waza compare:

config:
baseline: true
graders:
- type: prompt
name: battle_judge
config:
mode: pairwise
prompt: |
Compare the two answers and decide which one is better.
Call set_waza_grade_pass if the skill run wins.
Call set_waza_grade_fail if the baseline run wins.
tasks:
- id: battle-001
name: battle-001
inputs:
prompt: "Compare these two solutions and pick the better one."

Tasks define individual test cases loaded from YAML files:

Load tasks from YAML files in a directory:

tasks:
- "tasks/*.yaml" # All YAML files in tasks/
- "tasks/basic/*.yaml" # Specific subdirectory
- "tasks/advanced.yaml" # Single file

Individual task files (e.g., tasks/basic-usage.yaml):

id: basic-usage-001
name: Basic Usage - Python Function
description: Test that the skill explains a simple Python function correctly.
tags:
- basic
- happy-path
inputs:
prompt: "Explain this function"
files:
- path: sample.py
expected:
output_contains:
- "function"
- "parameter"
- "return"
outcomes:
- type: task_completed
behavior:
max_tool_calls: 5
FieldTypeDescription
idstringUnique task identifier
namestringHuman-readable task name
descriptionstringWhat the task tests
tagsarrayTags for filtering (e.g., ["basic", "edge-case"])
inputsobjectTest inputs (prompt, files)
expectedobjectValidation rules and expected behavior
skill_directoriesstring[]Skill directories for this task (overrides eval-level)
instruction_filesstring[]Instruction files for this task (adds to eval-level files)
inputs:
prompt: "Your instruction to the agent"
files:
- path: sample.py # Fixture file (relative to fixtures dir)
content: | # Or inline content
def hello():
print("Hello")

Use prompt_file instead of prompt to load the prompt text from an external file. The path is resolved relative to the task YAML file’s directory.

tasks/complex-review.yaml
inputs:
prompt_file: prompts/review-instructions.md
files:
- path: sample.py

This is useful when prompts are long, shared across tasks, or maintained separately. You must specify either prompt or prompt_file, but not both.

Use follow_up_prompts to send additional messages after the initial prompt. Each follow-up reuses the same session and workspace, so file changes and conversation history persist across turns.

inputs:
prompt: "Create a Python function that reads a CSV file"
follow_up_prompts:
- "Add error handling for missing files"
- "Write unit tests for the function"

This is useful for evaluating multi-turn conversations where each step builds on the previous one. Graders run only after all prompts (initial + follow-ups) have completed, so the final output reflects the full conversation.

Prompt supports templating:

inputs:
prompt: |
Explain this code:
{{fixture:sample.py}}
expected:
# Strings that must appear in output
output_contains:
- "function"
- "parameter"
# Output must NOT contain these
output_not_contains:
- "error"
- "failed"
# At least one of these must appear (flexible matching)
output_contains_any:
- "recursion"
- "iteration"
- "loop"
# Task outcomes
outcomes:
- type: task_completed
- type: tool_called
tool_name: code_analyzer
# Behavioral constraints
behavior:
max_tool_calls: 5
max_tokens: 4096
  • output_contains — ALL listed strings must appear (AND logic). Use for required content.
  • output_contains_any — At least ONE listed string must appear (OR logic). Use when the agent may express concepts in different ways.

All checks are case-insensitive.

Fixtures are test files (code, documents, data) that tasks reference.

Important: Each task gets a fresh temp workspace with fixtures copied in. Original fixtures are never modified.

Create a fixtures/ directory:

evals/code-explainer/
├── eval.yaml
├── tasks/
│ └── basic-usage.yaml
└── fixtures/
├── sample.py
├── complex.py
└── README.md

Reference in tasks:

inputs:
prompt: "Analyze {{fixture:sample.py}}"
files:
- path: sample.py

Use instruction_files for repository or task-specific *.instructions.md guidance:

eval.yaml
config:
instruction_files:
- .github/instructions/project.instructions.md
tasks/review.yaml
instruction_files:
- .github/instructions/review.instructions.md
inputs:
prompt: "Review this change"
files:
- path: sample.py

Instruction files are resolved from the active fixtures/context directory, copied into each fresh temp workspace, and appended to the agent system message with path labels. Eval-level files apply to every task; task-level files are added for that task. Paths must be relative and cannot use directory traversal.

Terminal window
# Project mode
evals/
└── code-explainer/
├── eval.yaml
├── tasks/
├── basic-usage.yaml
├── edge-case.yaml
└── should-not-trigger.yaml
└── fixtures/
├── sample.py
├── complex.py
└── nested/
└── module.py

Specify context directory when running:

Terminal window
waza run eval.yaml --context-dir evals/code-explainer/fixtures

Or use relative paths in eval.yaml if fixtures are adjacent.

Run the same eval against multiple models:

Terminal window
# Run with gpt-4o
waza run eval.yaml --model gpt-4o -o gpt4.json
# Run with Claude
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
# Compare results
waza compare gpt4.json sonnet.json

Override the default model in eval.yaml:

Terminal window
waza run eval.yaml --model gpt-4o # Overrides config.model
Terminal window
waza run eval.yaml --task "basic*" --task "edge*"
Terminal window
waza run eval.yaml --tags "happy-path"
Terminal window
# Run tasks concurrently with 4 workers
waza run eval.yaml --parallel --workers 4

Save eval results for later analysis or comparison:

Terminal window
waza run eval.yaml -o results.json

Output format:

{
"name": "code-explainer-eval",
"model": "claude-sonnet-4.6",
"pass_rate": 0.8,
"tasks": [
{
"id": "basic-001",
"name": "Basic Usage",
"passed": true,
"graders": [
{
"name": "checks_logic",
"passed": true,
"score": 1.0
}
]
}
]
}

For iterative testing, cache results:

Terminal window
waza run eval.yaml --cache --cache-dir .waza-cache

Only tasks with changed inputs/config re-run.

graders:
- type: text
name: format_check
config:
regex_match:
- "^[A-Z].*\\.$" # Sentence starting with capital, ending with period
tasks:
- "tasks/format/*.yaml"
graders:
- type: code
name: completeness
config:
assertions:
- "len(output) > 500"
- "'function' in output"
- "'parameter' in output"
tasks:
- "tasks/completeness/*.yaml"

Behavioral constraints are defined in individual task YAML files:

tasks/efficient-001.yaml
id: efficient-001
name: Efficiency test
inputs:
prompt: "Refactor this code"
expected:
behavior:
max_tool_calls: 3 # Efficient
max_tokens: 1000 # Concise
max_response_time_ms: 30000 # Must complete within 30 seconds
required_tools: # Must use these tools
- grep
- edit
forbidden_tools: # Must NOT use these tools
- rm
FieldTypeDescription
max_tool_callsintMaximum number of tool invocations allowed
max_iterationsintMaximum number of conversation rounds (turns)
max_tokensintMaximum tokens in the response
max_response_time_msintMaximum wall-clock execution time in milliseconds
required_toolsstring[]Tools the agent must use during the task
forbidden_toolsstring[]Tools the agent must NOT use during the task

Each constraint that is set (non-zero / non-empty) contributes equally to the behavior efficiency score. If all constraints pass, the score is 1.0; each failure reduces it proportionally.

Lifecycle hooks run shell commands at specific points during an evaluation. Use them for setup, teardown, or validation.

hooks:
before_run:
- command: "npm install"
working_directory: "./fixtures"
error_on_fail: true
after_run:
- command: "bash cleanup.sh"
before_task:
- command: "echo Starting task"
after_task:
- command: "bash collect-metrics.sh"
HookWhen it runs
before_runOnce, before the entire evaluation starts
after_runOnce, after all tasks complete
before_taskBefore each individual task
after_taskAfter each individual task

Each hook entry:

FieldTypeDefaultDescription
commandstring(required)Shell command to execute
working_directorystring.Working directory for the command
exit_codeslist[int][0]Acceptable exit codes
error_on_failboolfalseAbort the run if this hook fails

Use the inputs field to define global template variables that are substituted into task prompts:

inputs:
language: python
framework: fastapi
tasks:
- "tasks/scaffold/*.yaml"

Prompt templating also supports fixture file injection:

inputs:
prompt: |
Explain this code:
{{fixture:sample.py}}

The {{fixture:filename}} syntax inlines the content of a file from the fixtures directory into the prompt.


Use tasks_from to load task definitions from a separate YAML file:

name: shared-eval
tasks_from: shared-tasks.yaml
config:
trials_per_task: 3
model: claude-sonnet-4.6

This is useful when multiple eval specs share the same task set but differ in config or graders.


  1. Clear task descriptions — Future reviewers should understand what’s being tested
  2. Realistic validators — Don’t over-specify. A few key checks beat 20 strict rules
  3. Fixture diversity — Include basic, edge case, and negative test fixtures
  4. Tag your tasks — Makes filtering and analysis easier
  5. Use timeout appropriately — Too short = false failures, too long = slow tests
  6. Reuse graders — Define once, apply across multiple tasks
  7. Version your evals — Track improvements with version numbers