YAML Schema
Complete reference for YAML schema used in waza evaluations.
eval.yaml
Section titled “eval.yaml”Main evaluation configuration file.
name: code-explainer-eval # Required: eval suite namedescription: "..." # Required: what this eval testsskill: code-explainer # Required: skill nameversion: "1.0" # Optional: version number
config: trials_per_task: 1 # Runs per task timeout_seconds: 300 # Task timeout parallel: false # Concurrent execution workers: 0 # Auto-size parallel workers model: claude-sonnet-4.6 # Default model executor: mock # mock or copilot-sdk
graders: # Validation rules - type: text name: checks_logic config: regex_match: - "(?i)(function)"
tasks: # Test cases - "tasks/*.yaml" # From filesTop-Level Fields
Section titled “Top-Level Fields”Type: string
Required: yes
The unique name of this evaluation suite. Used in reports and result files.
name: code-explainer-evaldescription
Section titled “description”Type: string
Required: yes
Describes what this evaluation tests. Appears in reports.
description: "Evaluates agent's ability to explain Python code"Type: string
Required: yes
The skill being evaluated.
skill: code-explainerversion
Section titled “version”Type: string
Required: no
Default: (empty)
Version number of the evaluation suite.
version: "1.0"config Section
Section titled “config Section”trials_per_task
Section titled “trials_per_task”Type: integer
Default: 1
How many times each task is run. Use > 1 for statistical confidence.
config: trials_per_task: 3timeout_seconds
Section titled “timeout_seconds”Type: integer
Default: 300
Maximum seconds per task before timeout.
config: timeout_seconds: 300 # 5 minutesCommon values:
60— Quick validation tasks300— Standard code analysis600— Complex multi-file tasks
first_event_timeout_seconds
Section titled “first_event_timeout_seconds”Type: integer
Default: 0 (disabled)
Maximum seconds to wait for the first session event before treating a run as
a session-start hang and aborting it. The embedded engine can occasionally launch
but never start the agent’s first turn (no events at all); without this guard such
a run blocks until timeout_seconds, which must be sized for the slowest
legitimate full turn and is therefore a poor catch-all for a fast failure. Keep
this comfortably above realistic first-turn latency so a slow-but-live first turn
is never aborted. 0 disables the check. Override per task with the task-level
field of the same name (0 disables it for that task).
config: timeout_seconds: 1800 # 30 min — the slowest legitimate full turn first_event_timeout_seconds: 300 # but the first turn must start within 5 minparallel
Section titled “parallel”Type: boolean
Default: false
Run tasks concurrently instead of sequentially.
config: parallel: trueworkers
Section titled “workers”Type: integer
Default: 0
Number of concurrent workers when parallel: true. Use 0 or omit the field to auto-size workers.
config: parallel: true workers: 8Type: string
Default: (empty)
Default LLM model. Override with --model flag.
config: model: claude-sonnet-4.6executor
Section titled “executor”Type: string
Default: mock
Options: mock, copilot-sdk
Execution engine:
mock— Local testing (no API calls)copilot-sdk— Real LLM execution
config: executor: copilot-sdkmax_attempts
Section titled “max_attempts”Type: integer
Default: 0
Maximum retry attempts per task on failure. Set to 0 for no retries.
config: max_attempts: 3group_by
Section titled “group_by”Type: string
Default: (none)
Group results by a field in the output (e.g., tags, task_id). Useful for organizing results when running many tasks.
config: group_by: tagsfail_fast
Section titled “fail_fast”Type: boolean
Default: false
Stop the entire evaluation run on the first task failure instead of continuing.
config: fail_fast: trueskill_directories
Section titled “skill_directories”Type: list[string]
Default: []
Additional directories to search for skills beyond the default skills/ directory.
config: skill_directories: - ./custom-skills - /opt/shared-skillsinstruction_files
Section titled “instruction_files”Type: list[string]
Default: []
Instruction files to apply to every task. Paths are resolved relative to the active fixtures/context directory, copied into each task workspace, and appended to the agent system message as path-labeled instructions.
config: instruction_files: - .github/instructions/project.instructions.mdinject_skill_body
Section titled “inject_skill_body”Type: boolean Default: true
Controls whether Waza injects the full target SKILL.md or .agent.md body into the system prompt when skill: is set.
config: inject_skill_body: falseWhen false, Waza keeps skill discovery enabled and still includes the compact <available_skills> summary with names and descriptions, but it does not add the target <skill_context> block. This is useful for trigger-precision evals that use behavior or skill_invocation graders to measure whether the agent actually invokes the skill tool. disabled_skills: ["*"] takes precedence and disables all skill loading.
required_skills
Section titled “required_skills”Type: list[string]
Default: []
Skills that must be available before the evaluation runs. The run will fail if any required skill is not found.
config: required_skills: - code-analyzer - test-runnerdisabled_skills
Section titled “disabled_skills”Type: list[string]
Default: []
Skills to disable for the evaluation. Use ["*"] to disable all skill loading entirely, or list specific skill directory names to exclude.
# Disable all skillsconfig: disabled_skills: ["*"]
# Disable specific skillsconfig: disabled_skills: - noisy-skill - experimental-skillmcp_servers
Section titled “mcp_servers”Type: object
Default: (none)
MCP (Model Context Protocol) server configurations for this evaluation. Each key is a server name, and the value is its configuration.
config: mcp_servers: filesystem: command: /path/to/server args: [--root, /data] github: url: http://localhost:3000graders Section
Section titled “graders Section”List of validation rules. Used across tasks.
graders: - type: text name: pattern_check config: regex_match: ["success"]
- type: code name: logic_check config: assertions: - "len(output) > 0"Grader Fields
Section titled “Grader Fields”| Field | Type | Description |
|---|---|---|
type | string | Grader type: code, prompt, text, file, json_schema, program, behavior, action_sequence, skill_invocation, trigger, diff, tool_constraint, tool_calls |
name | string | Unique grader name (used to reference in tasks) |
weight | float | Relative importance in composite scoring (default: 1.0) |
config | object | Type-specific configuration |
See Validators & Graders for complete documentation.
tasks Section
Section titled “tasks Section”List of test cases to run.
tasks: - "tasks/*.yaml" # Load from files - "tasks/basic.yaml" # Specific fileLoading from Files
Section titled “Loading from Files”tasks: - "tasks/*.yaml" # All YAML files - "tasks/basic/*.yaml" # Subdirectory - "tasks/important.yaml" # Single fileFile path is relative to eval.yaml.
Task File Format
Section titled “Task File Format”Individual task files (tasks/task-name.yaml).
id: basic-usage-001 # Required: unique IDname: Basic Usage # Required: display namedescription: "..." # Optional: full description
tags: # Optional: for filtering - basic - happy-path
inputs: # Required: test inputs prompt: "Your instruction" files: - path: sample.py
expected: # Required: validation rules output_contains: ["function"] behavior: max_tool_calls: 5Type: string
Required: yes
Unique task identifier within the eval suite.
id: basic-usage-001Type: string
Required: yes
Human-readable task name.
name: "Basic Usage - Python Function"description
Section titled “description”Type: string
Required: no
Full description of what the task tests.
description: "Test that the agent explains a simple Python function correctly"Type: array of strings
Required: no
Tags for filtering and categorization.
tags: - basic - happy-path - pythonUsage:
waza run eval.yaml --tags "basic"waza run eval.yaml --tags "edge-case"skill_directories (optional)
Section titled “skill_directories (optional)”Override the eval-level skill_directories for a specific task. Paths are resolved relative to the eval YAML directory.
skill_directories: - ./skills/custom-task-skillsWhen specified on a task, this replaces (not merges with) the eval-level skill_directories.
instruction_files (optional)
Section titled “instruction_files (optional)”Apply additional instruction files for a specific task. Paths are resolved relative to the active fixtures/context directory. Task-level entries are added to the eval-level config.instruction_files list.
instruction_files: - .github/instructions/review.instructions.mdinputs Section
Section titled “inputs Section”Test inputs passed to the agent.
inputs: prompt: "Your instruction here" files: - path: sample.py - path: nested/module.py - content: | def hello(): print("Hi")prompt
Section titled “prompt”Type: string
Required: yes (unless prompt_file is used)
Instruction text sent to the agent. Supports templating:
inputs: prompt: | Explain this Python code: {{fixture:sample.py}}prompt_file
Section titled “prompt_file”Type: string
Required: no (alternative to prompt)
Path to a file containing the prompt text, resolved relative to the task YAML file’s directory.
Use this when prompts are long or shared across tasks. You must specify either prompt or prompt_file, but not both.
inputs: prompt_file: prompts/review-instructions.mdfollow_up_prompts
Section titled “follow_up_prompts”Type: array of strings
Required: no
Sequential follow-up prompts sent after the initial prompt. Each follow-up reuses the same session and workspace directory, preserving conversation history and file changes across turns.
inputs: prompt: "Create a helper function" follow_up_prompts: - "Add error handling" - "Write tests"Graders evaluate only the final state after all prompts complete. If any follow-up fails, remaining prompts are skipped and the run is marked as an error.
responder
Section titled “responder”Type: object
Required: no
An LLM-backed surrogate user that answers the skill’s follow-up questions during a multi-turn run. Mutually exclusive with follow_up_prompts.
| Field | Type | Required | Description |
|---|---|---|---|
model | string | no | Responder model. Defaults to the eval-level config.model. |
instructions | string | yes | Target configuration the responder represents + abstain rule. |
max_followups | integer | yes | Max responder replies before the loop stops (>= 1). |
inputs: prompt: "Add a new agent to my application" responder: instructions: "Be research-agent with tools web_search; abstain if unknown." max_followups: 8The responder classifies each agent message as reply, stop, or abstain. An abstain marks the run as an error with outcome abstained, distinct from model timeouts or network errors. If max_followups is reached while the agent is still asking questions, the loop stops with outcome cap_exhausted and graders evaluate the final state.
Type: array
Required: no
Files to include in the task context.
inputs: files: - path: sample.py # Reference fixture - content: "def foo(): ..." # Or inline contentType: array
Required: no
Local git repositories to materialize into the per-task workspace before the agent runs. Useful when developing skills inside the same repo the skills operate on — each test run gets a clean isolated checkout instead of hand-staged fixtures.
inputs: prompt: "Explain the repo layout" workdir: my-repo repos: - type: worktree # required; only "worktree" is currently supported source: /path/to/local/clone # required; local git repo to source from commit: main # optional; commit SHA, branch, or tag (defaults to HEAD) dest: my-repo # optional; subdir under workspace (omit to use workspace root)| Field | Required | Description |
|---|---|---|
type | yes | Materialization strategy. Currently only worktree (uses git worktree add --detach against a local source). |
source | yes | Local filesystem path to a git repository. |
commit | no | Commit SHA, branch, or tag. Defaults to source HEAD. Branch/tag names use --detach so they don’t conflict with the source checkout. |
dest | yes | Relative subdirectory under the workspace where the repo is materialized. Required because git worktree add refuses targets that already exist. Must not contain .. segments. |
Worktrees are removed via git worktree remove --force when the engine shuts down, before the workspace directory is deleted.
workdir
Section titled “workdir”Type: string
Required: no
Relative path inside the workspace to use as the agent’s working directory. Typically set to the same value as a repos[*].dest so the agent starts inside the checked-out repo. Must not contain path traversal.
inputs: workdir: my-repoexpected Section
Section titled “expected Section”Validation rules and constraints.
expected: output_contains: ["function", "parameter"] output_not_contains: ["error"] output_contains_any: ["recursion", "iteration", "loop"] outcomes: - type: task_completed behavior: max_tool_calls: 5 max_tokens: 4096output_contains
Section titled “output_contains”Type: array of strings
Strings that must appear in output.
expected: output_contains: - "function" - "parameter" - "return"output_not_contains
Section titled “output_not_contains”Type: array of strings
Strings that must NOT appear.
expected: output_not_contains: - "error" - "failed"output_contains_any
Section titled “output_contains_any”Type: array of strings
At least one of these strings must appear in the output (OR logic). Useful when an agent may express a concept in different ways. All checks are case-insensitive.
expected: output_contains_any: - "recursion" - "iteration" - "loop"matches
Section titled “matches”Type: array of strings
Regex patterns to match.
expected: matches: - "returns\\s+.*value" - "def\\s+\\w+\\("outcomes
Section titled “outcomes”Type: array of objects
Expected execution outcomes.
expected: outcomes: - type: task_completed - type: tool_called tool_name: code_analyzerbehavior
Section titled “behavior”Type: object
Behavioral constraints on agent execution. Each constraint that is set contributes to the behavior efficiency score.
| Field | Type | Description |
|---|---|---|
max_tool_calls | int | Maximum tool invocations allowed |
max_iterations | int | Maximum conversation rounds (turns) |
max_tokens | int | Maximum tokens in the response |
max_response_time_ms | int | Maximum wall-clock execution time in milliseconds |
required_tools | string[] | Tools the agent must use |
forbidden_tools | string[] | Tools the agent must NOT use |
expected: behavior: max_tool_calls: 5 max_iterations: 10 max_tokens: 4096 max_response_time_ms: 30000 # 30 seconds required_tools: - grep forbidden_tools: - rm.waza.yaml Configuration
Section titled “.waza.yaml Configuration”Optional project-level configuration file.
# Root of waza projectpaths: skills: skills evals: evals results: results
# Generated/discovered eval and task namingfiles: evalFile: eval.yaml taskGlob: tasks/*.yaml taskFileSuffix: .yaml
# Model defaultsdefaults: model: claude-sonnet-4.6 timeout: 300 workers: 0
# Cache settingscache: enabled: false dir: .waza-cache
# Token budget configurationtokens: warningThreshold: 2500 fallbackLimit: 1000 limits: defaults: "SKILL.md": 500 "references/**/*.md": 1000 "*.md": 2000 overrides: "README.md": 3000
# Cloud storage (optional)storage: provider: azure-blob accountName: "myteamwaza" containerName: "waza-results" enabled: truefiles Section
Section titled “files Section”Controls the filenames generated by waza new skill and waza new eval. Workspace discovery prefers files.evalFile and falls back to eval.yaml for compatibility.
| Field | Type | Default | Description |
|---|---|---|---|
evalFile | string | eval.yaml | Eval filename to generate and discover |
taskGlob | string | tasks/*.yaml | Task glob written into generated eval specs |
taskFileSuffix | string | .yaml | Suffix for generated task files |
For editor-friendly Waza-specific names:
files: evalFile: waza-eval.yaml taskGlob: tasks/*.waza-task.yaml taskFileSuffix: .waza-task.yamltokens Section
Section titled “tokens Section”Per-file token budget configuration used by waza tokens check and waza tokens suggest.
See the Token Limits guide for full documentation including pattern matching rules and migration from .token-limits.json.
Priority order: .waza.yaml tokens.limits → .token-limits.json (legacy, emits deprecation warning) → built-in defaults.
| Field | Type | Default | Description |
|---|---|---|---|
warningThreshold | integer | 2500 | Token count at which a soft warning is shown |
fallbackLimit | integer | 1000 | Limit applied to files that match no pattern |
limits.defaults | map | (built-in) | Glob patterns → token limits |
limits.overrides | map | {} | Exact file paths → token limits (take precedence over defaults) |
Both limits.defaults and limits.overrides are optional. When tokens.limits is present in .waza.yaml, .token-limits.json is not consulted.
storage Section
Section titled “storage Section”Configuration for uploading evaluation results to cloud storage.
provider
Section titled “provider”Type: string
Values: azure-blob
The cloud provider to use for result storage.
storage: provider: azure-blobaccountName
Section titled “accountName”Type: string
Required: yes (when storage: is configured)
The Azure Storage account name. Results are uploaded to blob storage in this account.
storage: accountName: "myteamwaza"containerName
Section titled “containerName”Type: string
Default: waza-results
The blob container where results are stored.
storage: containerName: "waza-results"enabled
Section titled “enabled”Type: boolean
Default: true (when storage: is configured)
Enable or disable automatic result uploads. When false, results are only saved locally.
storage: enabled: trueExample Complete eval.yaml
Section titled “Example Complete eval.yaml”name: code-explainer-evaldescription: "Evaluation suite for code-explainer skill"skill: code-explainerversion: "1.0"
config: trials_per_task: 1 timeout_seconds: 300 parallel: false workers: 0 model: claude-sonnet-4.6 executor: copilot-sdk
graders: - type: text name: explains_concepts config: regex_match: - "function" - "parameter" - "return"
- type: code name: minimum_length config: assertions: - "len(output) > 200"
- type: tool_calls name: reasonable_calls config: max_calls: 5
tasks: - "tasks/*.yaml"JSON Schema (programmatic access)
Section titled “JSON Schema (programmatic access)”See schemas/config.schema.json in the repository for complete JSON Schema.
# Validate configjq . schemas/config.schema.jsonNext Steps
Section titled “Next Steps”- Writing Eval Specs — Full guide with examples
- CLI Reference — All commands
- GitHub Repository — Source code