Skip to content

YAML Schema

Complete reference for YAML schema used in waza evaluations.

Main evaluation configuration file.

name: code-explainer-eval # Required: eval suite name
description: "..." # Required: what this eval tests
skill: code-explainer # Required: skill name
version: "1.0" # Optional: version number
config:
trials_per_task: 1 # Runs per task
timeout_seconds: 300 # Task timeout
parallel: false # Concurrent execution
workers: 4 # Parallel workers
model: claude-sonnet-4.6 # Default model
executor: mock # mock or copilot-sdk
graders: # Validation rules
- type: text
name: checks_logic
config:
regex_match:
- "(?i)(function)"
tasks: # Test cases
- "tasks/*.yaml" # From files

Type: string
Required: yes

The unique name of this evaluation suite. Used in reports and result files.

name: code-explainer-eval

Type: string
Required: yes

Describes what this evaluation tests. Appears in reports.

description: "Evaluates agent's ability to explain Python code"

Type: string
Required: yes

The skill being evaluated.

skill: code-explainer

Type: string
Required: no
Default: (empty)

Version number of the evaluation suite.

version: "1.0"

Type: integer
Default: 1

How many times each task is run. Use > 1 for statistical confidence.

config:
trials_per_task: 3

Type: integer
Default: 300

Maximum seconds per task before timeout.

config:
timeout_seconds: 300 # 5 minutes

Common values:

  • 60 — Quick validation tasks
  • 300 — Standard code analysis
  • 600 — Complex multi-file tasks

Type: boolean
Default: false

Run tasks concurrently instead of sequentially.

config:
parallel: true

Type: integer
Default: 4

Number of concurrent workers when parallel: true.

config:
parallel: true
workers: 8

Type: string
Default: (empty)

Default LLM model. Override with --model flag.

config:
model: claude-sonnet-4.6

Type: string
Default: mock
Options: mock, copilot-sdk

Execution engine:

  • mock — Local testing (no API calls)
  • copilot-sdk — Real LLM execution
config:
executor: copilot-sdk

Type: integer
Default: 0

Maximum retry attempts per task on failure. Set to 0 for no retries.

config:
max_attempts: 3

Type: string
Default: (none)

Group results by a field in the output (e.g., tags, task_id). Useful for organizing results when running many tasks.

config:
group_by: tags

Type: boolean
Default: false

Stop the entire evaluation run on the first task failure instead of continuing.

config:
fail_fast: true

Type: list[string]
Default: []

Additional directories to search for skills beyond the default skills/ directory.

config:
skill_directories:
- ./custom-skills
- /opt/shared-skills

Type: list[string]
Default: []

Skills that must be available before the evaluation runs. The run will fail if any required skill is not found.

config:
required_skills:
- code-analyzer
- test-runner

Type: list[string]
Default: []

Skills to disable for the evaluation. Use ["*"] to disable all skill loading entirely, or list specific skill directory names to exclude.

# Disable all skills
config:
disabled_skills: ["*"]
# Disable specific skills
config:
disabled_skills:
- noisy-skill
- experimental-skill

Type: object
Default: (none)

MCP (Model Context Protocol) server configurations for this evaluation. Each key is a server name, and the value is its configuration.

config:
mcp_servers:
filesystem:
command: /path/to/server
args: [--root, /data]
github:
url: http://localhost:3000

List of validation rules. Used across tasks.

graders:
- type: text
name: pattern_check
config:
regex_match: ["success"]
- type: code
name: logic_check
config:
assertions:
- "len(output) > 0"
FieldTypeDescription
typestringGrader type: code, prompt, text, file, json_schema, program, behavior, action_sequence, skill_invocation, trigger, diff, tool_constraint, tool_calls
namestringUnique grader name (used to reference in tasks)
weightfloatRelative importance in composite scoring (default: 1.0)
configobjectType-specific configuration

See Validators & Graders for complete documentation.

List of test cases to run.

tasks:
- "tasks/*.yaml" # Load from files
- "tasks/basic.yaml" # Specific file
tasks:
- "tasks/*.yaml" # All YAML files
- "tasks/basic/*.yaml" # Subdirectory
- "tasks/important.yaml" # Single file

File path is relative to eval.yaml.

Individual task files (tasks/task-name.yaml).

id: basic-usage-001 # Required: unique ID
name: Basic Usage # Required: display name
description: "..." # Optional: full description
tags: # Optional: for filtering
- basic
- happy-path
inputs: # Required: test inputs
prompt: "Your instruction"
files:
- path: sample.py
expected: # Required: validation rules
output_contains: ["function"]
behavior:
max_tool_calls: 5

Type: string
Required: yes

Unique task identifier within the eval suite.

id: basic-usage-001

Type: string
Required: yes

Human-readable task name.

name: "Basic Usage - Python Function"

Type: string
Required: no

Full description of what the task tests.

description: "Test that the agent explains a simple Python function correctly"

Type: array of strings
Required: no

Tags for filtering and categorization.

tags:
- basic
- happy-path
- python

Usage:

Terminal window
waza run eval.yaml --tags "basic"
waza run eval.yaml --tags "edge-case"

Override the eval-level skill_directories for a specific task. Paths are resolved relative to the eval YAML directory.

skill_directories:
- ./skills/custom-task-skills

When specified on a task, this replaces (not merges with) the eval-level skill_directories.

Test inputs passed to the agent.

inputs:
prompt: "Your instruction here"
files:
- path: sample.py
- path: nested/module.py
- content: |
def hello():
print("Hi")

Type: string
Required: yes (unless prompt_file is used)

Instruction text sent to the agent. Supports templating:

inputs:
prompt: |
Explain this Python code:
{{fixture:sample.py}}

Type: string
Required: no (alternative to prompt)

Path to a file containing the prompt text, resolved relative to the task YAML file’s directory. Use this when prompts are long or shared across tasks. You must specify either prompt or prompt_file, but not both.

inputs:
prompt_file: prompts/review-instructions.md

Type: array of strings
Required: no

Sequential follow-up prompts sent after the initial prompt. Each follow-up reuses the same session and workspace directory, preserving conversation history and file changes across turns.

inputs:
prompt: "Create a helper function"
follow_up_prompts:
- "Add error handling"
- "Write tests"

Graders evaluate only the final state after all prompts complete. If any follow-up fails, remaining prompts are skipped and the run is marked as an error.

Type: array
Required: no

Files to include in the task context.

inputs:
files:
- path: sample.py # Reference fixture
- content: "def foo(): ..." # Or inline content

Validation rules and constraints.

expected:
output_contains: ["function", "parameter"]
output_not_contains: ["error"]
output_contains_any: ["recursion", "iteration", "loop"]
outcomes:
- type: task_completed
behavior:
max_tool_calls: 5
max_tokens: 4096

Type: array of strings

Strings that must appear in output.

expected:
output_contains:
- "function"
- "parameter"
- "return"

Type: array of strings

Strings that must NOT appear.

expected:
output_not_contains:
- "error"
- "failed"

Type: array of strings

At least one of these strings must appear in the output (OR logic). Useful when an agent may express a concept in different ways. All checks are case-insensitive.

expected:
output_contains_any:
- "recursion"
- "iteration"
- "loop"

Type: array of strings

Regex patterns to match.

expected:
matches:
- "returns\\s+.*value"
- "def\\s+\\w+\\("

Type: array of objects

Expected execution outcomes.

expected:
outcomes:
- type: task_completed
- type: tool_called
tool_name: code_analyzer

Type: object

Behavioral constraints on agent execution. Each constraint that is set contributes to the behavior efficiency score.

FieldTypeDescription
max_tool_callsintMaximum tool invocations allowed
max_iterationsintMaximum conversation rounds (turns)
max_tokensintMaximum tokens in the response
max_response_time_msintMaximum wall-clock execution time in milliseconds
required_toolsstring[]Tools the agent must use
forbidden_toolsstring[]Tools the agent must NOT use
expected:
behavior:
max_tool_calls: 5
max_iterations: 10
max_tokens: 4096
max_response_time_ms: 30000 # 30 seconds
required_tools:
- grep
forbidden_tools:
- rm

Optional project-level configuration file.

# Root of waza project
paths:
skills: skills
evals: evals
results: results
# Model defaults
defaults:
model: claude-sonnet-4.6
timeout: 300
workers: 4
# Cache settings
cache:
enabled: false
dir: .waza-cache
# Token budget configuration
tokens:
warningThreshold: 2500
fallbackLimit: 1000
limits:
defaults:
"SKILL.md": 500
"references/**/*.md": 1000
"*.md": 2000
overrides:
"README.md": 3000
# Cloud storage (optional)
storage:
provider: azure-blob
accountName: "myteamwaza"
containerName: "waza-results"
enabled: true

Per-file token budget configuration used by waza tokens check and waza tokens suggest. See the Token Limits guide for full documentation including pattern matching rules and migration from .token-limits.json.

Priority order: .waza.yaml tokens.limits.token-limits.json (legacy, emits deprecation warning) → built-in defaults.

FieldTypeDefaultDescription
warningThresholdinteger2500Token count at which a soft warning is shown
fallbackLimitinteger1000Limit applied to files that match no pattern
limits.defaultsmap(built-in)Glob patterns → token limits
limits.overridesmap{}Exact file paths → token limits (take precedence over defaults)

Both limits.defaults and limits.overrides are optional. When tokens.limits is present in .waza.yaml, .token-limits.json is not consulted.

Configuration for uploading evaluation results to cloud storage.

Type: string
Values: azure-blob

The cloud provider to use for result storage.

storage:
provider: azure-blob

Type: string
Required: yes (when storage: is configured)

The Azure Storage account name. Results are uploaded to blob storage in this account.

storage:
accountName: "myteamwaza"

Type: string
Default: waza-results

The blob container where results are stored.

storage:
containerName: "waza-results"

Type: boolean
Default: true (when storage: is configured)

Enable or disable automatic result uploads. When false, results are only saved locally.

storage:
enabled: true
name: code-explainer-eval
description: "Evaluation suite for code-explainer skill"
skill: code-explainer
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
workers: 4
model: claude-sonnet-4.6
executor: copilot-sdk
graders:
- type: text
name: explains_concepts
config:
regex_match:
- "function"
- "parameter"
- "return"
- type: code
name: minimum_length
config:
assertions:
- "len(output) > 200"
- type: tool_calls
name: reasonable_calls
config:
max_calls: 5
tasks:
- "tasks/*.yaml"

See schemas/config.schema.json in the repository for complete JSON Schema.

Terminal window
# Validate config
jq . schemas/config.schema.json