YAML Schema
Complete reference for YAML schema used in waza evaluations.
eval.yaml
Section titled “eval.yaml”Main evaluation configuration file.
name: code-explainer-eval # Required: eval suite namedescription: "..." # Required: what this eval testsskill: code-explainer # Required: skill nameversion: "1.0" # Optional: version number
config: trials_per_task: 1 # Runs per task timeout_seconds: 300 # Task timeout parallel: false # Concurrent execution workers: 4 # Parallel workers model: claude-sonnet-4.6 # Default model executor: mock # mock or copilot-sdk
graders: # Validation rules - type: text name: checks_logic config: regex_match: "(?i)(function)"
tasks: # Test cases - "tasks/*.yaml" # From files - id: inline-001 # Or inline name: Inline Task inputs: prompt: "..."Top-Level Fields
Section titled “Top-Level Fields”Type: string
Required: yes
The unique name of this evaluation suite. Used in reports and result files.
name: code-explainer-evaldescription
Section titled “description”Type: string
Required: yes
Describes what this evaluation tests. Appears in reports.
description: "Evaluates agent's ability to explain Python code"Type: string
Required: yes
The skill being evaluated.
skill: code-explainerversion
Section titled “version”Type: string
Required: no
Default: (empty)
Version number of the evaluation suite.
version: "1.0"config Section
Section titled “config Section”trials_per_task
Section titled “trials_per_task”Type: integer
Default: 1
How many times each task is run. Use > 1 for statistical confidence.
config: trials_per_task: 3timeout_seconds
Section titled “timeout_seconds”Type: integer
Default: 300
Maximum seconds per task before timeout.
config: timeout_seconds: 300 # 5 minutesCommon values:
60— Quick validation tasks300— Standard code analysis600— Complex multi-file tasks
parallel
Section titled “parallel”Type: boolean
Default: false
Run tasks concurrently instead of sequentially.
config: parallel: trueworkers
Section titled “workers”Type: integer
Default: 4
Number of concurrent workers when parallel: true.
config: parallel: true workers: 8Type: string
Default: (empty)
Default LLM model. Override with --model flag.
config: model: claude-sonnet-4.6executor
Section titled “executor”Type: string
Default: mock
Options: mock, copilot-sdk
Execution engine:
mock— Local testing (no API calls)copilot-sdk— Real LLM execution
config: executor: copilot-sdkgraders Section
Section titled “graders Section”List of validation rules. Used across tasks.
graders: - type: text name: pattern_check config: regex_match: ["success"]
- type: code name: logic_check config: assertions: - "len(output) > 0"Grader Fields
Section titled “Grader Fields”| Field | Type | Description |
|---|---|---|
type | string | Grader type: code, regex, keyword, file, diff, json_schema, prompt, behavior, action_sequence, skill_invocation, program |
name | string | Unique grader name (used to reference in tasks) |
weight | float | Relative importance in composite scoring (default: 1.0) |
config | object | Type-specific configuration |
See Validators & Graders for complete documentation.
tasks Section
Section titled “tasks Section”List of test cases to run.
tasks: - "tasks/*.yaml" # Load from files - "tasks/basic.yaml" # Specific file - id: inline-001 # Inline task name: "Test 1" inputs: prompt: "..."Loading from Files
Section titled “Loading from Files”tasks: - "tasks/*.yaml" # All YAML files - "tasks/basic/*.yaml" # Subdirectory - "tasks/important.yaml" # Single fileFile path is relative to eval.yaml.
Inline Tasks
Section titled “Inline Tasks”Define tasks directly in eval.yaml:
tasks: - id: task-001 name: "Basic Usage" inputs: prompt: "Explain this"Task File Format
Section titled “Task File Format”Individual task files (tasks/task-name.yaml).
id: basic-usage-001 # Required: unique IDname: Basic Usage # Required: display namedescription: "..." # Optional: full description
tags: # Optional: for filtering - basic - happy-path
inputs: # Required: test inputs prompt: "Your instruction" files: - path: sample.py
expected: # Required: validation rules output_contains: ["function"] behavior: max_tool_calls: 5Type: string
Required: yes
Unique task identifier within the eval suite.
id: basic-usage-001Type: string
Required: yes
Human-readable task name.
name: "Basic Usage - Python Function"description
Section titled “description”Type: string
Required: no
Full description of what the task tests.
description: "Test that the agent explains a simple Python function correctly"Type: array of strings
Required: no
Tags for filtering and categorization.
tags: - basic - happy-path - pythonUsage:
waza run eval.yaml --tags "basic"waza run eval.yaml --tags "edge-case"inputs Section
Section titled “inputs Section”Test inputs passed to the agent.
inputs: prompt: "Your instruction here" files: - path: sample.py - path: nested/module.py - content: | def hello(): print("Hi")prompt
Section titled “prompt”Type: string
Required: yes
Instruction text sent to the agent. Supports templating:
inputs: prompt: | Explain this Python code: {{fixture:sample.py}}Type: array
Required: no
Files to include in the task context.
inputs: files: - path: sample.py # Reference fixture - content: "def foo(): ..." # Or inline contentexpected Section
Section titled “expected Section”Validation rules and constraints.
expected: output_contains: ["function", "parameter"] output_excludes: ["error"] matches: ["def\\s+\\w+"] outcomes: - type: task_completed behavior: max_tool_calls: 5 max_response_time_ms: 30000 max_tokens: 4096output_contains
Section titled “output_contains”Type: array of strings
Strings that must appear in output.
expected: output_contains: - "function" - "parameter" - "return"output_excludes
Section titled “output_excludes”Type: array of strings
Strings that must NOT appear.
expected: output_excludes: - "error" - "failed"matches
Section titled “matches”Type: array of strings
Regex patterns to match.
expected: matches: - "returns\\s+.*value" - "def\\s+\\w+\\("outcomes
Section titled “outcomes”Type: array of objects
Expected execution outcomes.
expected: outcomes: - type: task_completed - type: tool_called tool_name: code_analyzerbehavior
Section titled “behavior”Type: object
Behavioral constraints.
expected: behavior: max_tool_calls: 5 # Max tools to invoke max_response_time_ms: 30000 # Max execution time max_tokens: 4096 # Max tokens in response.waza.yaml Configuration
Section titled “.waza.yaml Configuration”Optional project-level configuration file.
# Root of waza projectpaths: skills: skills evals: evals results: results
# Model defaultsdefaults: model: claude-sonnet-4.6 timeout: 300 workers: 4
# Cache settingscache: enabled: false dir: .waza-cache
# Cloud storage (optional)storage: provider: azure-blob accountName: "myteamwaza" containerName: "waza-results" enabled: truestorage Section
Section titled “storage Section”Configuration for uploading evaluation results to cloud storage.
provider
Section titled “provider”Type: string
Values: azure-blob
The cloud provider to use for result storage.
storage: provider: azure-blobaccountName
Section titled “accountName”Type: string
Required: yes (when storage: is configured)
The Azure Storage account name. Results are uploaded to blob storage in this account.
storage: accountName: "myteamwaza"containerName
Section titled “containerName”Type: string
Default: waza-results
The blob container where results are stored.
storage: containerName: "waza-results"enabled
Section titled “enabled”Type: boolean
Default: true (when storage: is configured)
Enable or disable automatic result uploads. When false, results are only saved locally.
storage: enabled: trueExample Complete eval.yaml
Section titled “Example Complete eval.yaml”name: code-explainer-evaldescription: "Evaluation suite for code-explainer skill"skill: code-explainerversion: "1.0"
config: trials_per_task: 1 timeout_seconds: 300 parallel: false workers: 4 model: claude-sonnet-4.6 executor: copilot-sdk
graders: - type: text name: explains_concepts config: regex_match: - "function" - "parameter" - "return"
- type: code name: minimum_length config: assertions: - "len(output) > 200"
- type: tool_calls name: reasonable_calls config: max_calls: 5
tasks: - "tasks/*.yaml"JSON Schema (programmatic access)
Section titled “JSON Schema (programmatic access)”See schemas/config.schema.json in the repository for complete JSON Schema.
# Validate configjq . schemas/config.schema.jsonNext Steps
Section titled “Next Steps”- Writing Eval Specs — Full guide with examples
- CLI Reference — All commands
- GitHub Repository — Source code