YAML Schema
Complete reference for YAML schema used in waza evaluations.
eval.yaml
Section titled “eval.yaml”Main evaluation configuration file.
name: code-explainer-eval # Required: eval suite namedescription: "..." # Required: what this eval testsskill: code-explainer # Required: skill nameversion: "1.0" # Optional: version number
config: trials_per_task: 1 # Runs per task timeout_seconds: 300 # Task timeout parallel: false # Concurrent execution workers: 4 # Parallel workers model: claude-sonnet-4.6 # Default model executor: mock # mock or copilot-sdk
graders: # Validation rules - type: text name: checks_logic config: regex_match: - "(?i)(function)"
tasks: # Test cases - "tasks/*.yaml" # From filesTop-Level Fields
Section titled “Top-Level Fields”Type: string
Required: yes
The unique name of this evaluation suite. Used in reports and result files.
name: code-explainer-evaldescription
Section titled “description”Type: string
Required: yes
Describes what this evaluation tests. Appears in reports.
description: "Evaluates agent's ability to explain Python code"Type: string
Required: yes
The skill being evaluated.
skill: code-explainerversion
Section titled “version”Type: string
Required: no
Default: (empty)
Version number of the evaluation suite.
version: "1.0"config Section
Section titled “config Section”trials_per_task
Section titled “trials_per_task”Type: integer
Default: 1
How many times each task is run. Use > 1 for statistical confidence.
config: trials_per_task: 3timeout_seconds
Section titled “timeout_seconds”Type: integer
Default: 300
Maximum seconds per task before timeout.
config: timeout_seconds: 300 # 5 minutesCommon values:
60— Quick validation tasks300— Standard code analysis600— Complex multi-file tasks
parallel
Section titled “parallel”Type: boolean
Default: false
Run tasks concurrently instead of sequentially.
config: parallel: trueworkers
Section titled “workers”Type: integer
Default: 4
Number of concurrent workers when parallel: true.
config: parallel: true workers: 8Type: string
Default: (empty)
Default LLM model. Override with --model flag.
config: model: claude-sonnet-4.6executor
Section titled “executor”Type: string
Default: mock
Options: mock, copilot-sdk
Execution engine:
mock— Local testing (no API calls)copilot-sdk— Real LLM execution
config: executor: copilot-sdkmax_attempts
Section titled “max_attempts”Type: integer
Default: 0
Maximum retry attempts per task on failure. Set to 0 for no retries.
config: max_attempts: 3group_by
Section titled “group_by”Type: string
Default: (none)
Group results by a field in the output (e.g., tags, task_id). Useful for organizing results when running many tasks.
config: group_by: tagsfail_fast
Section titled “fail_fast”Type: boolean
Default: false
Stop the entire evaluation run on the first task failure instead of continuing.
config: fail_fast: trueskill_directories
Section titled “skill_directories”Type: list[string]
Default: []
Additional directories to search for skills beyond the default skills/ directory.
config: skill_directories: - ./custom-skills - /opt/shared-skillsrequired_skills
Section titled “required_skills”Type: list[string]
Default: []
Skills that must be available before the evaluation runs. The run will fail if any required skill is not found.
config: required_skills: - code-analyzer - test-runnerdisabled_skills
Section titled “disabled_skills”Type: list[string]
Default: []
Skills to disable for the evaluation. Use ["*"] to disable all skill loading entirely, or list specific skill directory names to exclude.
# Disable all skillsconfig: disabled_skills: ["*"]
# Disable specific skillsconfig: disabled_skills: - noisy-skill - experimental-skillmcp_servers
Section titled “mcp_servers”Type: object
Default: (none)
MCP (Model Context Protocol) server configurations for this evaluation. Each key is a server name, and the value is its configuration.
config: mcp_servers: filesystem: command: /path/to/server args: [--root, /data] github: url: http://localhost:3000graders Section
Section titled “graders Section”List of validation rules. Used across tasks.
graders: - type: text name: pattern_check config: regex_match: ["success"]
- type: code name: logic_check config: assertions: - "len(output) > 0"Grader Fields
Section titled “Grader Fields”| Field | Type | Description |
|---|---|---|
type | string | Grader type: code, prompt, text, file, json_schema, program, behavior, action_sequence, skill_invocation, trigger, diff, tool_constraint, tool_calls |
name | string | Unique grader name (used to reference in tasks) |
weight | float | Relative importance in composite scoring (default: 1.0) |
config | object | Type-specific configuration |
See Validators & Graders for complete documentation.
tasks Section
Section titled “tasks Section”List of test cases to run.
tasks: - "tasks/*.yaml" # Load from files - "tasks/basic.yaml" # Specific fileLoading from Files
Section titled “Loading from Files”tasks: - "tasks/*.yaml" # All YAML files - "tasks/basic/*.yaml" # Subdirectory - "tasks/important.yaml" # Single fileFile path is relative to eval.yaml.
Task File Format
Section titled “Task File Format”Individual task files (tasks/task-name.yaml).
id: basic-usage-001 # Required: unique IDname: Basic Usage # Required: display namedescription: "..." # Optional: full description
tags: # Optional: for filtering - basic - happy-path
inputs: # Required: test inputs prompt: "Your instruction" files: - path: sample.py
expected: # Required: validation rules output_contains: ["function"] behavior: max_tool_calls: 5Type: string
Required: yes
Unique task identifier within the eval suite.
id: basic-usage-001Type: string
Required: yes
Human-readable task name.
name: "Basic Usage - Python Function"description
Section titled “description”Type: string
Required: no
Full description of what the task tests.
description: "Test that the agent explains a simple Python function correctly"Type: array of strings
Required: no
Tags for filtering and categorization.
tags: - basic - happy-path - pythonUsage:
waza run eval.yaml --tags "basic"waza run eval.yaml --tags "edge-case"skill_directories (optional)
Section titled “skill_directories (optional)”Override the eval-level skill_directories for a specific task. Paths are resolved relative to the eval YAML directory.
skill_directories: - ./skills/custom-task-skillsWhen specified on a task, this replaces (not merges with) the eval-level skill_directories.
inputs Section
Section titled “inputs Section”Test inputs passed to the agent.
inputs: prompt: "Your instruction here" files: - path: sample.py - path: nested/module.py - content: | def hello(): print("Hi")prompt
Section titled “prompt”Type: string
Required: yes (unless prompt_file is used)
Instruction text sent to the agent. Supports templating:
inputs: prompt: | Explain this Python code: {{fixture:sample.py}}prompt_file
Section titled “prompt_file”Type: string
Required: no (alternative to prompt)
Path to a file containing the prompt text, resolved relative to the task YAML file’s directory.
Use this when prompts are long or shared across tasks. You must specify either prompt or prompt_file, but not both.
inputs: prompt_file: prompts/review-instructions.mdfollow_up_prompts
Section titled “follow_up_prompts”Type: array of strings
Required: no
Sequential follow-up prompts sent after the initial prompt. Each follow-up reuses the same session and workspace directory, preserving conversation history and file changes across turns.
inputs: prompt: "Create a helper function" follow_up_prompts: - "Add error handling" - "Write tests"Graders evaluate only the final state after all prompts complete. If any follow-up fails, remaining prompts are skipped and the run is marked as an error.
Type: array
Required: no
Files to include in the task context.
inputs: files: - path: sample.py # Reference fixture - content: "def foo(): ..." # Or inline contentexpected Section
Section titled “expected Section”Validation rules and constraints.
expected: output_contains: ["function", "parameter"] output_not_contains: ["error"] output_contains_any: ["recursion", "iteration", "loop"] outcomes: - type: task_completed behavior: max_tool_calls: 5 max_tokens: 4096output_contains
Section titled “output_contains”Type: array of strings
Strings that must appear in output.
expected: output_contains: - "function" - "parameter" - "return"output_not_contains
Section titled “output_not_contains”Type: array of strings
Strings that must NOT appear.
expected: output_not_contains: - "error" - "failed"output_contains_any
Section titled “output_contains_any”Type: array of strings
At least one of these strings must appear in the output (OR logic). Useful when an agent may express a concept in different ways. All checks are case-insensitive.
expected: output_contains_any: - "recursion" - "iteration" - "loop"matches
Section titled “matches”Type: array of strings
Regex patterns to match.
expected: matches: - "returns\\s+.*value" - "def\\s+\\w+\\("outcomes
Section titled “outcomes”Type: array of objects
Expected execution outcomes.
expected: outcomes: - type: task_completed - type: tool_called tool_name: code_analyzerbehavior
Section titled “behavior”Type: object
Behavioral constraints on agent execution. Each constraint that is set contributes to the behavior efficiency score.
| Field | Type | Description |
|---|---|---|
max_tool_calls | int | Maximum tool invocations allowed |
max_iterations | int | Maximum conversation rounds (turns) |
max_tokens | int | Maximum tokens in the response |
max_response_time_ms | int | Maximum wall-clock execution time in milliseconds |
required_tools | string[] | Tools the agent must use |
forbidden_tools | string[] | Tools the agent must NOT use |
expected: behavior: max_tool_calls: 5 max_iterations: 10 max_tokens: 4096 max_response_time_ms: 30000 # 30 seconds required_tools: - grep forbidden_tools: - rm.waza.yaml Configuration
Section titled “.waza.yaml Configuration”Optional project-level configuration file.
# Root of waza projectpaths: skills: skills evals: evals results: results
# Model defaultsdefaults: model: claude-sonnet-4.6 timeout: 300 workers: 4
# Cache settingscache: enabled: false dir: .waza-cache
# Token budget configurationtokens: warningThreshold: 2500 fallbackLimit: 1000 limits: defaults: "SKILL.md": 500 "references/**/*.md": 1000 "*.md": 2000 overrides: "README.md": 3000
# Cloud storage (optional)storage: provider: azure-blob accountName: "myteamwaza" containerName: "waza-results" enabled: truetokens Section
Section titled “tokens Section”Per-file token budget configuration used by waza tokens check and waza tokens suggest.
See the Token Limits guide for full documentation including pattern matching rules and migration from .token-limits.json.
Priority order: .waza.yaml tokens.limits → .token-limits.json (legacy, emits deprecation warning) → built-in defaults.
| Field | Type | Default | Description |
|---|---|---|---|
warningThreshold | integer | 2500 | Token count at which a soft warning is shown |
fallbackLimit | integer | 1000 | Limit applied to files that match no pattern |
limits.defaults | map | (built-in) | Glob patterns → token limits |
limits.overrides | map | {} | Exact file paths → token limits (take precedence over defaults) |
Both limits.defaults and limits.overrides are optional. When tokens.limits is present in .waza.yaml, .token-limits.json is not consulted.
storage Section
Section titled “storage Section”Configuration for uploading evaluation results to cloud storage.
provider
Section titled “provider”Type: string
Values: azure-blob
The cloud provider to use for result storage.
storage: provider: azure-blobaccountName
Section titled “accountName”Type: string
Required: yes (when storage: is configured)
The Azure Storage account name. Results are uploaded to blob storage in this account.
storage: accountName: "myteamwaza"containerName
Section titled “containerName”Type: string
Default: waza-results
The blob container where results are stored.
storage: containerName: "waza-results"enabled
Section titled “enabled”Type: boolean
Default: true (when storage: is configured)
Enable or disable automatic result uploads. When false, results are only saved locally.
storage: enabled: trueExample Complete eval.yaml
Section titled “Example Complete eval.yaml”name: code-explainer-evaldescription: "Evaluation suite for code-explainer skill"skill: code-explainerversion: "1.0"
config: trials_per_task: 1 timeout_seconds: 300 parallel: false workers: 4 model: claude-sonnet-4.6 executor: copilot-sdk
graders: - type: text name: explains_concepts config: regex_match: - "function" - "parameter" - "return"
- type: code name: minimum_length config: assertions: - "len(output) > 200"
- type: tool_calls name: reasonable_calls config: max_calls: 5
tasks: - "tasks/*.yaml"JSON Schema (programmatic access)
Section titled “JSON Schema (programmatic access)”See schemas/config.schema.json in the repository for complete JSON Schema.
# Validate configjq . schemas/config.schema.jsonNext Steps
Section titled “Next Steps”- Writing Eval Specs — Full guide with examples
- CLI Reference — All commands
- GitHub Repository — Source code