Skip to content

YAML Schema

Complete reference for YAML schema used in waza evaluations.

Main evaluation configuration file.

name: code-explainer-eval # Required: eval suite name
description: "..." # Required: what this eval tests
skill: code-explainer # Required: skill name
version: "1.0" # Optional: version number
config:
trials_per_task: 1 # Runs per task
timeout_seconds: 300 # Task timeout
parallel: false # Concurrent execution
workers: 4 # Parallel workers
model: claude-sonnet-4.6 # Default model
executor: mock # mock or copilot-sdk
graders: # Validation rules
- type: text
name: checks_logic
config:
regex_match:
- "(?i)(function)"
tasks: # Test cases
- "tasks/*.yaml" # From files

Type: string
Required: yes

The unique name of this evaluation suite. Used in reports and result files.

name: code-explainer-eval

Type: string
Required: yes

Describes what this evaluation tests. Appears in reports.

description: "Evaluates agent's ability to explain Python code"

Type: string
Required: yes

The skill being evaluated.

skill: code-explainer

Type: string
Required: no
Default: (empty)

Version number of the evaluation suite.

version: "1.0"

Type: integer
Default: 1

How many times each task is run. Use > 1 for statistical confidence.

config:
trials_per_task: 3

Type: integer
Default: 300

Maximum seconds per task before timeout.

config:
timeout_seconds: 300 # 5 minutes

Common values:

  • 60 — Quick validation tasks
  • 300 — Standard code analysis
  • 600 — Complex multi-file tasks

Type: boolean
Default: false

Run tasks concurrently instead of sequentially.

config:
parallel: true

Type: integer
Default: 4

Number of concurrent workers when parallel: true.

config:
parallel: true
workers: 8

Type: string
Default: (empty)

Default LLM model. Override with --model flag.

config:
model: claude-sonnet-4.6

Type: string
Default: mock
Options: mock, copilot-sdk

Execution engine:

  • mock — Local testing (no API calls)
  • copilot-sdk — Real LLM execution
config:
executor: copilot-sdk

Type: integer
Default: 0

Maximum retry attempts per task on failure. Set to 0 for no retries.

config:
max_attempts: 3

Type: string
Default: (none)

Group results by a field in the output (e.g., tags, task_id). Useful for organizing results when running many tasks.

config:
group_by: tags

Type: boolean
Default: false

Stop the entire evaluation run on the first task failure instead of continuing.

config:
fail_fast: true

Type: list[string]
Default: []

Additional directories to search for skills beyond the default skills/ directory.

config:
skill_directories:
- ./custom-skills
- /opt/shared-skills

Type: list[string] Default: []

Instruction files to apply to every task. Paths are resolved relative to the active fixtures/context directory, copied into each task workspace, and appended to the agent system message as path-labeled instructions.

config:
instruction_files:
- .github/instructions/project.instructions.md

Type: list[string]
Default: []

Skills that must be available before the evaluation runs. The run will fail if any required skill is not found.

config:
required_skills:
- code-analyzer
- test-runner

Type: list[string]
Default: []

Skills to disable for the evaluation. Use ["*"] to disable all skill loading entirely, or list specific skill directory names to exclude.

# Disable all skills
config:
disabled_skills: ["*"]
# Disable specific skills
config:
disabled_skills:
- noisy-skill
- experimental-skill

Type: object
Default: (none)

MCP (Model Context Protocol) server configurations for this evaluation. Each key is a server name, and the value is its configuration.

config:
mcp_servers:
filesystem:
command: /path/to/server
args: [--root, /data]
github:
url: http://localhost:3000

List of validation rules. Used across tasks.

graders:
- type: text
name: pattern_check
config:
regex_match: ["success"]
- type: code
name: logic_check
config:
assertions:
- "len(output) > 0"
FieldTypeDescription
typestringGrader type: code, prompt, text, file, json_schema, program, behavior, action_sequence, skill_invocation, trigger, diff, tool_constraint, tool_calls
namestringUnique grader name (used to reference in tasks)
weightfloatRelative importance in composite scoring (default: 1.0)
configobjectType-specific configuration

See Validators & Graders for complete documentation.

List of test cases to run.

tasks:
- "tasks/*.yaml" # Load from files
- "tasks/basic.yaml" # Specific file
tasks:
- "tasks/*.yaml" # All YAML files
- "tasks/basic/*.yaml" # Subdirectory
- "tasks/important.yaml" # Single file

File path is relative to eval.yaml.

Individual task files (tasks/task-name.yaml).

id: basic-usage-001 # Required: unique ID
name: Basic Usage # Required: display name
description: "..." # Optional: full description
tags: # Optional: for filtering
- basic
- happy-path
inputs: # Required: test inputs
prompt: "Your instruction"
files:
- path: sample.py
expected: # Required: validation rules
output_contains: ["function"]
behavior:
max_tool_calls: 5

Type: string
Required: yes

Unique task identifier within the eval suite.

id: basic-usage-001

Type: string
Required: yes

Human-readable task name.

name: "Basic Usage - Python Function"

Type: string
Required: no

Full description of what the task tests.

description: "Test that the agent explains a simple Python function correctly"

Type: array of strings
Required: no

Tags for filtering and categorization.

tags:
- basic
- happy-path
- python

Usage:

Terminal window
waza run eval.yaml --tags "basic"
waza run eval.yaml --tags "edge-case"

Override the eval-level skill_directories for a specific task. Paths are resolved relative to the eval YAML directory.

skill_directories:
- ./skills/custom-task-skills

When specified on a task, this replaces (not merges with) the eval-level skill_directories.

Apply additional instruction files for a specific task. Paths are resolved relative to the active fixtures/context directory. Task-level entries are added to the eval-level config.instruction_files list.

instruction_files:
- .github/instructions/review.instructions.md

Test inputs passed to the agent.

inputs:
prompt: "Your instruction here"
files:
- path: sample.py
- path: nested/module.py
- content: |
def hello():
print("Hi")

Type: string
Required: yes (unless prompt_file is used)

Instruction text sent to the agent. Supports templating:

inputs:
prompt: |
Explain this Python code:
{{fixture:sample.py}}

Type: string
Required: no (alternative to prompt)

Path to a file containing the prompt text, resolved relative to the task YAML file’s directory. Use this when prompts are long or shared across tasks. You must specify either prompt or prompt_file, but not both.

inputs:
prompt_file: prompts/review-instructions.md

Type: array of strings
Required: no

Sequential follow-up prompts sent after the initial prompt. Each follow-up reuses the same session and workspace directory, preserving conversation history and file changes across turns.

inputs:
prompt: "Create a helper function"
follow_up_prompts:
- "Add error handling"
- "Write tests"

Graders evaluate only the final state after all prompts complete. If any follow-up fails, remaining prompts are skipped and the run is marked as an error.

Type: array
Required: no

Files to include in the task context.

inputs:
files:
- path: sample.py # Reference fixture
- content: "def foo(): ..." # Or inline content

Validation rules and constraints.

expected:
output_contains: ["function", "parameter"]
output_not_contains: ["error"]
output_contains_any: ["recursion", "iteration", "loop"]
outcomes:
- type: task_completed
behavior:
max_tool_calls: 5
max_tokens: 4096

Type: array of strings

Strings that must appear in output.

expected:
output_contains:
- "function"
- "parameter"
- "return"

Type: array of strings

Strings that must NOT appear.

expected:
output_not_contains:
- "error"
- "failed"

Type: array of strings

At least one of these strings must appear in the output (OR logic). Useful when an agent may express a concept in different ways. All checks are case-insensitive.

expected:
output_contains_any:
- "recursion"
- "iteration"
- "loop"

Type: array of strings

Regex patterns to match.

expected:
matches:
- "returns\\s+.*value"
- "def\\s+\\w+\\("

Type: array of objects

Expected execution outcomes.

expected:
outcomes:
- type: task_completed
- type: tool_called
tool_name: code_analyzer

Type: object

Behavioral constraints on agent execution. Each constraint that is set contributes to the behavior efficiency score.

FieldTypeDescription
max_tool_callsintMaximum tool invocations allowed
max_iterationsintMaximum conversation rounds (turns)
max_tokensintMaximum tokens in the response
max_response_time_msintMaximum wall-clock execution time in milliseconds
required_toolsstring[]Tools the agent must use
forbidden_toolsstring[]Tools the agent must NOT use
expected:
behavior:
max_tool_calls: 5
max_iterations: 10
max_tokens: 4096
max_response_time_ms: 30000 # 30 seconds
required_tools:
- grep
forbidden_tools:
- rm

Optional project-level configuration file.

# Root of waza project
paths:
skills: skills
evals: evals
results: results
# Generated/discovered eval and task naming
files:
evalFile: eval.yaml
taskGlob: tasks/*.yaml
taskFileSuffix: .yaml
# Model defaults
defaults:
model: claude-sonnet-4.6
timeout: 300
workers: 4
# Cache settings
cache:
enabled: false
dir: .waza-cache
# Token budget configuration
tokens:
warningThreshold: 2500
fallbackLimit: 1000
limits:
defaults:
"SKILL.md": 500
"references/**/*.md": 1000
"*.md": 2000
overrides:
"README.md": 3000
# Cloud storage (optional)
storage:
provider: azure-blob
accountName: "myteamwaza"
containerName: "waza-results"
enabled: true

Controls the filenames generated by waza new skill and waza new eval. Workspace discovery prefers files.evalFile and falls back to eval.yaml for compatibility.

FieldTypeDefaultDescription
evalFilestringeval.yamlEval filename to generate and discover
taskGlobstringtasks/*.yamlTask glob written into generated eval specs
taskFileSuffixstring.yamlSuffix for generated task files

For editor-friendly Waza-specific names:

files:
evalFile: waza-eval.yaml
taskGlob: tasks/*.waza-task.yaml
taskFileSuffix: .waza-task.yaml

Per-file token budget configuration used by waza tokens check and waza tokens suggest. See the Token Limits guide for full documentation including pattern matching rules and migration from .token-limits.json.

Priority order: .waza.yaml tokens.limits.token-limits.json (legacy, emits deprecation warning) → built-in defaults.

FieldTypeDefaultDescription
warningThresholdinteger2500Token count at which a soft warning is shown
fallbackLimitinteger1000Limit applied to files that match no pattern
limits.defaultsmap(built-in)Glob patterns → token limits
limits.overridesmap{}Exact file paths → token limits (take precedence over defaults)

Both limits.defaults and limits.overrides are optional. When tokens.limits is present in .waza.yaml, .token-limits.json is not consulted.

Configuration for uploading evaluation results to cloud storage.

Type: string
Values: azure-blob

The cloud provider to use for result storage.

storage:
provider: azure-blob

Type: string
Required: yes (when storage: is configured)

The Azure Storage account name. Results are uploaded to blob storage in this account.

storage:
accountName: "myteamwaza"

Type: string
Default: waza-results

The blob container where results are stored.

storage:
containerName: "waza-results"

Type: boolean
Default: true (when storage: is configured)

Enable or disable automatic result uploads. When false, results are only saved locally.

storage:
enabled: true
name: code-explainer-eval
description: "Evaluation suite for code-explainer skill"
skill: code-explainer
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
workers: 4
model: claude-sonnet-4.6
executor: copilot-sdk
graders:
- type: text
name: explains_concepts
config:
regex_match:
- "function"
- "parameter"
- "return"
- type: code
name: minimum_length
config:
assertions:
- "len(output) > 200"
- type: tool_calls
name: reasonable_calls
config:
max_calls: 5
tasks:
- "tasks/*.yaml"

See schemas/config.schema.json in the repository for complete JSON Schema.

Terminal window
# Validate config
jq . schemas/config.schema.json