Skip to content

YAML Schema

Complete reference for YAML schema used in waza evaluations.

Main evaluation configuration file.

name: code-explainer-eval # Required: eval suite name
description: "..." # Required: what this eval tests
skill: code-explainer # Required: skill name
version: "1.0" # Optional: version number
config:
trials_per_task: 1 # Runs per task
timeout_seconds: 300 # Task timeout
parallel: false # Concurrent execution
workers: 0 # Auto-size parallel workers
model: claude-sonnet-4.6 # Default model
executor: mock # mock or copilot-sdk
graders: # Validation rules
- type: text
name: checks_logic
config:
regex_match:
- "(?i)(function)"
tasks: # Test cases
- "tasks/*.yaml" # From files

Type: string
Required: yes

The unique name of this evaluation suite. Used in reports and result files.

name: code-explainer-eval

Type: string
Required: yes

Describes what this evaluation tests. Appears in reports.

description: "Evaluates agent's ability to explain Python code"

Type: string
Required: yes

The skill being evaluated.

skill: code-explainer

Type: string
Required: no
Default: (empty)

Version number of the evaluation suite.

version: "1.0"

Type: integer
Default: 1

How many times each task is run. Use > 1 for statistical confidence.

config:
trials_per_task: 3

Type: integer
Default: 300

Maximum seconds per task before timeout.

config:
timeout_seconds: 300 # 5 minutes

Common values:

  • 60 — Quick validation tasks
  • 300 — Standard code analysis
  • 600 — Complex multi-file tasks

Type: integer
Default: 0 (disabled)

Maximum seconds to wait for the first session event before treating a run as a session-start hang and aborting it. The embedded engine can occasionally launch but never start the agent’s first turn (no events at all); without this guard such a run blocks until timeout_seconds, which must be sized for the slowest legitimate full turn and is therefore a poor catch-all for a fast failure. Keep this comfortably above realistic first-turn latency so a slow-but-live first turn is never aborted. 0 disables the check. Override per task with the task-level field of the same name (0 disables it for that task).

config:
timeout_seconds: 1800 # 30 min — the slowest legitimate full turn
first_event_timeout_seconds: 300 # but the first turn must start within 5 min

Type: boolean
Default: false

Run tasks concurrently instead of sequentially.

config:
parallel: true

Type: integer
Default: 0

Number of concurrent workers when parallel: true. Use 0 or omit the field to auto-size workers.

config:
parallel: true
workers: 8

Type: string
Default: (empty)

Default LLM model. Override with --model flag.

config:
model: claude-sonnet-4.6

Type: string
Default: mock
Options: mock, copilot-sdk

Execution engine:

  • mock — Local testing (no API calls)
  • copilot-sdk — Real LLM execution
config:
executor: copilot-sdk

Type: integer
Default: 0

Maximum retry attempts per task on failure. Set to 0 for no retries.

config:
max_attempts: 3

Type: string
Default: (none)

Group results by a field in the output (e.g., tags, task_id). Useful for organizing results when running many tasks.

config:
group_by: tags

Type: boolean
Default: false

Stop the entire evaluation run on the first task failure instead of continuing.

config:
fail_fast: true

Type: list[string]
Default: []

Additional directories to search for skills beyond the default skills/ directory.

config:
skill_directories:
- ./custom-skills
- /opt/shared-skills

Type: list[string] Default: []

Instruction files to apply to every task. Paths are resolved relative to the active fixtures/context directory, copied into each task workspace, and appended to the agent system message as path-labeled instructions.

config:
instruction_files:
- .github/instructions/project.instructions.md

Type: boolean Default: true

Controls whether Waza injects the full target SKILL.md or .agent.md body into the system prompt when skill: is set.

config:
inject_skill_body: false

When false, Waza keeps skill discovery enabled and still includes the compact <available_skills> summary with names and descriptions, but it does not add the target <skill_context> block. This is useful for trigger-precision evals that use behavior or skill_invocation graders to measure whether the agent actually invokes the skill tool. disabled_skills: ["*"] takes precedence and disables all skill loading.

Type: list[string]
Default: []

Skills that must be available before the evaluation runs. The run will fail if any required skill is not found.

config:
required_skills:
- code-analyzer
- test-runner

Type: list[string]
Default: []

Skills to disable for the evaluation. Use ["*"] to disable all skill loading entirely, or list specific skill directory names to exclude.

# Disable all skills
config:
disabled_skills: ["*"]
# Disable specific skills
config:
disabled_skills:
- noisy-skill
- experimental-skill

Type: object
Default: (none)

MCP (Model Context Protocol) server configurations for this evaluation. Each key is a server name, and the value is its configuration.

config:
mcp_servers:
filesystem:
command: /path/to/server
args: [--root, /data]
github:
url: http://localhost:3000

List of validation rules. Used across tasks.

graders:
- type: text
name: pattern_check
config:
regex_match: ["success"]
- type: code
name: logic_check
config:
assertions:
- "len(output) > 0"
FieldTypeDescription
typestringGrader type: code, prompt, text, file, json_schema, program, behavior, action_sequence, skill_invocation, trigger, diff, tool_constraint, tool_calls
namestringUnique grader name (used to reference in tasks)
weightfloatRelative importance in composite scoring (default: 1.0)
configobjectType-specific configuration

See Validators & Graders for complete documentation.

List of test cases to run.

tasks:
- "tasks/*.yaml" # Load from files
- "tasks/basic.yaml" # Specific file
tasks:
- "tasks/*.yaml" # All YAML files
- "tasks/basic/*.yaml" # Subdirectory
- "tasks/important.yaml" # Single file

File path is relative to eval.yaml.

Individual task files (tasks/task-name.yaml).

id: basic-usage-001 # Required: unique ID
name: Basic Usage # Required: display name
description: "..." # Optional: full description
tags: # Optional: for filtering
- basic
- happy-path
inputs: # Required: test inputs
prompt: "Your instruction"
files:
- path: sample.py
expected: # Required: validation rules
output_contains: ["function"]
behavior:
max_tool_calls: 5

Type: string
Required: yes

Unique task identifier within the eval suite.

id: basic-usage-001

Type: string
Required: yes

Human-readable task name.

name: "Basic Usage - Python Function"

Type: string
Required: no

Full description of what the task tests.

description: "Test that the agent explains a simple Python function correctly"

Type: array of strings
Required: no

Tags for filtering and categorization.

tags:
- basic
- happy-path
- python

Usage:

Terminal window
waza run eval.yaml --tags "basic"
waza run eval.yaml --tags "edge-case"

Override the eval-level skill_directories for a specific task. Paths are resolved relative to the eval YAML directory.

skill_directories:
- ./skills/custom-task-skills

When specified on a task, this replaces (not merges with) the eval-level skill_directories.

Apply additional instruction files for a specific task. Paths are resolved relative to the active fixtures/context directory. Task-level entries are added to the eval-level config.instruction_files list.

instruction_files:
- .github/instructions/review.instructions.md

Test inputs passed to the agent.

inputs:
prompt: "Your instruction here"
files:
- path: sample.py
- path: nested/module.py
- content: |
def hello():
print("Hi")

Type: string
Required: yes (unless prompt_file is used)

Instruction text sent to the agent. Supports templating:

inputs:
prompt: |
Explain this Python code:
{{fixture:sample.py}}

Type: string
Required: no (alternative to prompt)

Path to a file containing the prompt text, resolved relative to the task YAML file’s directory. Use this when prompts are long or shared across tasks. You must specify either prompt or prompt_file, but not both.

inputs:
prompt_file: prompts/review-instructions.md

Type: array of strings
Required: no

Sequential follow-up prompts sent after the initial prompt. Each follow-up reuses the same session and workspace directory, preserving conversation history and file changes across turns.

inputs:
prompt: "Create a helper function"
follow_up_prompts:
- "Add error handling"
- "Write tests"

Graders evaluate only the final state after all prompts complete. If any follow-up fails, remaining prompts are skipped and the run is marked as an error.

Type: object
Required: no

An LLM-backed surrogate user that answers the skill’s follow-up questions during a multi-turn run. Mutually exclusive with follow_up_prompts.

FieldTypeRequiredDescription
modelstringnoResponder model. Defaults to the eval-level config.model.
instructionsstringyesTarget configuration the responder represents + abstain rule.
max_followupsintegeryesMax responder replies before the loop stops (>= 1).
inputs:
prompt: "Add a new agent to my application"
responder:
instructions: "Be research-agent with tools web_search; abstain if unknown."
max_followups: 8

The responder classifies each agent message as reply, stop, or abstain. An abstain marks the run as an error with outcome abstained, distinct from model timeouts or network errors. If max_followups is reached while the agent is still asking questions, the loop stops with outcome cap_exhausted and graders evaluate the final state.

Type: array
Required: no

Files to include in the task context.

inputs:
files:
- path: sample.py # Reference fixture
- content: "def foo(): ..." # Or inline content

Type: array
Required: no

Local git repositories to materialize into the per-task workspace before the agent runs. Useful when developing skills inside the same repo the skills operate on — each test run gets a clean isolated checkout instead of hand-staged fixtures.

inputs:
prompt: "Explain the repo layout"
workdir: my-repo
repos:
- type: worktree # required; only "worktree" is currently supported
source: /path/to/local/clone # required; local git repo to source from
commit: main # optional; commit SHA, branch, or tag (defaults to HEAD)
dest: my-repo # optional; subdir under workspace (omit to use workspace root)
FieldRequiredDescription
typeyesMaterialization strategy. Currently only worktree (uses git worktree add --detach against a local source).
sourceyesLocal filesystem path to a git repository.
commitnoCommit SHA, branch, or tag. Defaults to source HEAD. Branch/tag names use --detach so they don’t conflict with the source checkout.
destyesRelative subdirectory under the workspace where the repo is materialized. Required because git worktree add refuses targets that already exist. Must not contain .. segments.

Worktrees are removed via git worktree remove --force when the engine shuts down, before the workspace directory is deleted.

Type: string
Required: no

Relative path inside the workspace to use as the agent’s working directory. Typically set to the same value as a repos[*].dest so the agent starts inside the checked-out repo. Must not contain path traversal.

inputs:
workdir: my-repo

Validation rules and constraints.

expected:
output_contains: ["function", "parameter"]
output_not_contains: ["error"]
output_contains_any: ["recursion", "iteration", "loop"]
outcomes:
- type: task_completed
behavior:
max_tool_calls: 5
max_tokens: 4096

Type: array of strings

Strings that must appear in output.

expected:
output_contains:
- "function"
- "parameter"
- "return"

Type: array of strings

Strings that must NOT appear.

expected:
output_not_contains:
- "error"
- "failed"

Type: array of strings

At least one of these strings must appear in the output (OR logic). Useful when an agent may express a concept in different ways. All checks are case-insensitive.

expected:
output_contains_any:
- "recursion"
- "iteration"
- "loop"

Type: array of strings

Regex patterns to match.

expected:
matches:
- "returns\\s+.*value"
- "def\\s+\\w+\\("

Type: array of objects

Expected execution outcomes.

expected:
outcomes:
- type: task_completed
- type: tool_called
tool_name: code_analyzer

Type: object

Behavioral constraints on agent execution. Each constraint that is set contributes to the behavior efficiency score.

FieldTypeDescription
max_tool_callsintMaximum tool invocations allowed
max_iterationsintMaximum conversation rounds (turns)
max_tokensintMaximum tokens in the response
max_response_time_msintMaximum wall-clock execution time in milliseconds
required_toolsstring[]Tools the agent must use
forbidden_toolsstring[]Tools the agent must NOT use
expected:
behavior:
max_tool_calls: 5
max_iterations: 10
max_tokens: 4096
max_response_time_ms: 30000 # 30 seconds
required_tools:
- grep
forbidden_tools:
- rm

Optional project-level configuration file.

# Root of waza project
paths:
skills: skills
evals: evals
results: results
# Generated/discovered eval and task naming
files:
evalFile: eval.yaml
taskGlob: tasks/*.yaml
taskFileSuffix: .yaml
# Model defaults
defaults:
model: claude-sonnet-4.6
timeout: 300
workers: 0
# Cache settings
cache:
enabled: false
dir: .waza-cache
# Token budget configuration
tokens:
warningThreshold: 2500
fallbackLimit: 1000
limits:
defaults:
"SKILL.md": 500
"references/**/*.md": 1000
"*.md": 2000
overrides:
"README.md": 3000
# Cloud storage (optional)
storage:
provider: azure-blob
accountName: "myteamwaza"
containerName: "waza-results"
enabled: true

Controls the filenames generated by waza new skill and waza new eval. Workspace discovery prefers files.evalFile and falls back to eval.yaml for compatibility.

FieldTypeDefaultDescription
evalFilestringeval.yamlEval filename to generate and discover
taskGlobstringtasks/*.yamlTask glob written into generated eval specs
taskFileSuffixstring.yamlSuffix for generated task files

For editor-friendly Waza-specific names:

files:
evalFile: waza-eval.yaml
taskGlob: tasks/*.waza-task.yaml
taskFileSuffix: .waza-task.yaml

Per-file token budget configuration used by waza tokens check and waza tokens suggest. See the Token Limits guide for full documentation including pattern matching rules and migration from .token-limits.json.

Priority order: .waza.yaml tokens.limits.token-limits.json (legacy, emits deprecation warning) → built-in defaults.

FieldTypeDefaultDescription
warningThresholdinteger2500Token count at which a soft warning is shown
fallbackLimitinteger1000Limit applied to files that match no pattern
limits.defaultsmap(built-in)Glob patterns → token limits
limits.overridesmap{}Exact file paths → token limits (take precedence over defaults)

Both limits.defaults and limits.overrides are optional. When tokens.limits is present in .waza.yaml, .token-limits.json is not consulted.

Configuration for uploading evaluation results to cloud storage.

Type: string
Values: azure-blob

The cloud provider to use for result storage.

storage:
provider: azure-blob

Type: string
Required: yes (when storage: is configured)

The Azure Storage account name. Results are uploaded to blob storage in this account.

storage:
accountName: "myteamwaza"

Type: string
Default: waza-results

The blob container where results are stored.

storage:
containerName: "waza-results"

Type: boolean
Default: true (when storage: is configured)

Enable or disable automatic result uploads. When false, results are only saved locally.

storage:
enabled: true
name: code-explainer-eval
description: "Evaluation suite for code-explainer skill"
skill: code-explainer
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
workers: 0
model: claude-sonnet-4.6
executor: copilot-sdk
graders:
- type: text
name: explains_concepts
config:
regex_match:
- "function"
- "parameter"
- "return"
- type: code
name: minimum_length
config:
assertions:
- "len(output) > 200"
- type: tool_calls
name: reasonable_calls
config:
max_calls: 5
tasks:
- "tasks/*.yaml"

See schemas/config.schema.json in the repository for complete JSON Schema.

Terminal window
# Validate config
jq . schemas/config.schema.json