YAML Schema

Complete reference for YAML schema used in waza evaluations.

eval.yaml

Main evaluation configuration file.

name: code-explainer-eval          # Required: eval suite name
description: "..."                 # Required: what this eval tests
skill: code-explainer              # Required: skill name
version: "1.0"                      # Optional: version number

config:
  trials_per_task: 1               # Runs per task
  timeout_seconds: 300             # Task timeout
  parallel: false                  # Concurrent execution
  workers: 4                        # Parallel workers
  model: claude-sonnet-4.6          # Default model
  executor: mock                    # mock or copilot-sdk

graders:                            # Validation rules
  - type: text
    name: checks_logic
    config:
      regex_match: "(?i)(function)"

tasks:                              # Test cases
  - "tasks/*.yaml"                 # From files
  - id: inline-001                 # Or inline
    name: Inline Task
    inputs:
      prompt: "..."

Top-Level Fields

name

Type: string
Required: yes

The unique name of this evaluation suite. Used in reports and result files.

name: code-explainer-eval

description

Type: string
Required: yes

Describes what this evaluation tests. Appears in reports.

description: "Evaluates agent's ability to explain Python code"

skill

Type: string
Required: yes

The skill being evaluated.

skill: code-explainer

version

Type: string
Required: no
Default: (empty)

Version number of the evaluation suite.

version: "1.0"

config Section

trials_per_task

Type: integer
Default: 1

How many times each task is run. Use > 1 for statistical confidence.

config:
  trials_per_task: 3

timeout_seconds

Type: integer
Default: 300

Maximum seconds per task before timeout.

config:
  timeout_seconds: 300  # 5 minutes

Common values:

60 — Quick validation tasks
300 — Standard code analysis
600 — Complex multi-file tasks

parallel

Type: boolean
Default: false

Run tasks concurrently instead of sequentially.

config:
  parallel: true

workers

Type: integer
Default: 4

Number of concurrent workers when parallel: true.

config:
  parallel: true
  workers: 8

model

Type: string
Default: (empty)

Default LLM model. Override with --model flag.

config:
  model: claude-sonnet-4.6

executor

Type: string
Default: mock
Options: mock, copilot-sdk

Execution engine:

mock — Local testing (no API calls)
copilot-sdk — Real LLM execution

config:
  executor: copilot-sdk

graders Section

List of validation rules. Used across tasks.

graders:
  - type: text
    name: pattern_check
    config:
      regex_match: ["success"]

  - type: code
    name: logic_check
    config:
      assertions:
        - "len(output) > 0"

Grader Fields

Field	Type	Description
`type`	string	Grader type: `code`, `regex`, `keyword`, `file`, `diff`, `json_schema`, `prompt`, `behavior`, `action_sequence`, `skill_invocation`, `program`
`name`	string	Unique grader name (used to reference in tasks)
`weight`	float	Relative importance in composite scoring (default: `1.0`)
`config`	object	Type-specific configuration

See Validators & Graders for complete documentation.

tasks Section

List of test cases to run.

tasks:
  - "tasks/*.yaml"          # Load from files
  - "tasks/basic.yaml"      # Specific file
  - id: inline-001          # Inline task
    name: "Test 1"
    inputs:
      prompt: "..."

Loading from Files

tasks:
  - "tasks/*.yaml"              # All YAML files
  - "tasks/basic/*.yaml"        # Subdirectory
  - "tasks/important.yaml"      # Single file

File path is relative to eval.yaml.

Inline Tasks

Define tasks directly in eval.yaml:

tasks:
  - id: task-001
    name: "Basic Usage"
    inputs:
      prompt: "Explain this"

Task File Format

Individual task files (tasks/task-name.yaml).

id: basic-usage-001              # Required: unique ID
name: Basic Usage                # Required: display name
description: "..."               # Optional: full description

tags:                            # Optional: for filtering
  - basic
  - happy-path

inputs:                          # Required: test inputs
  prompt: "Your instruction"
  files:
    - path: sample.py

expected:                        # Required: validation rules
  output_contains: ["function"]
  behavior:
    max_tool_calls: 5

id

Type: string
Required: yes

Unique task identifier within the eval suite.

id: basic-usage-001

name

Type: string
Required: yes

Human-readable task name.

name: "Basic Usage - Python Function"

description

Type: string
Required: no

Full description of what the task tests.

description: "Test that the agent explains a simple Python function correctly"

inputs Section

Test inputs passed to the agent.

inputs:
  prompt: "Your instruction here"
  files:
    - path: sample.py
    - path: nested/module.py
    - content: |
        def hello():
            print("Hi")

prompt

Type: string
Required: yes

Instruction text sent to the agent. Supports templating:

inputs:
  prompt: |
    Explain this Python code:
    {{fixture:sample.py}}

files

Type: array
Required: no

Files to include in the task context.

inputs:
  files:
    - path: sample.py           # Reference fixture
    - content: "def foo(): ..."  # Or inline content

expected Section

Validation rules and constraints.

expected:
  output_contains: ["function", "parameter"]
  output_excludes: ["error"]
  matches: ["def\\s+\\w+"]
  outcomes:
    - type: task_completed
  behavior:
    max_tool_calls: 5
    max_response_time_ms: 30000
    max_tokens: 4096

output_contains

Type: array of strings

Strings that must appear in output.

expected:
  output_contains:
    - "function"
    - "parameter"
    - "return"

output_excludes

Type: array of strings

Strings that must NOT appear.

expected:
  output_excludes:
    - "error"
    - "failed"

matches

Type: array of strings

Regex patterns to match.

expected:
  matches:
    - "returns\\s+.*value"
    - "def\\s+\\w+\\("

outcomes

Type: array of objects

Expected execution outcomes.

expected:
  outcomes:
    - type: task_completed
    - type: tool_called
      tool_name: code_analyzer

behavior

Type: object

Behavioral constraints.

expected:
  behavior:
    max_tool_calls: 5           # Max tools to invoke
    max_response_time_ms: 30000  # Max execution time
    max_tokens: 4096            # Max tokens in response

.waza.yaml Configuration

Optional project-level configuration file.

# Root of waza project
paths:
  skills: skills
  evals: evals
  results: results

# Model defaults
defaults:
  model: claude-sonnet-4.6
  timeout: 300
  workers: 4

# Cache settings
cache:
  enabled: false
  dir: .waza-cache

# Token budget configuration
tokens:
  warningThreshold: 2500
  fallbackLimit: 1000
  limits:
    defaults:
      "SKILL.md": 500
      "references/**/*.md": 1000
      "*.md": 2000
    overrides:
      "README.md": 3000

# Cloud storage (optional)
storage:
  provider: azure-blob
  accountName: "myteamwaza"
  containerName: "waza-results"
  enabled: true

tokens Section

Per-file token budget configuration used by waza tokens check and waza tokens suggest. See the Token Limits guide for full documentation including pattern matching rules and migration from .token-limits.json.

Priority order: .waza.yaml tokens.limits → .token-limits.json (legacy, emits deprecation warning) → built-in defaults.

Field	Type	Default	Description
`warningThreshold`	integer	`2500`	Token count at which a soft warning is shown
`fallbackLimit`	integer	`1000`	Limit applied to files that match no pattern
`limits.defaults`	map	(built-in)	Glob patterns → token limits
`limits.overrides`	map	`{}`	Exact file paths → token limits (take precedence over `defaults`)

Both limits.defaults and limits.overrides are optional. When tokens.limits is present in .waza.yaml, .token-limits.json is not consulted.

storage Section

Configuration for uploading evaluation results to cloud storage.

provider

Type: string
Values: azure-blob

The cloud provider to use for result storage.

storage:
  provider: azure-blob

accountName

Type: string
Required: yes (when storage: is configured)

The Azure Storage account name. Results are uploaded to blob storage in this account.

storage:
  accountName: "myteamwaza"

containerName

Type: string
Default: waza-results

The blob container where results are stored.

storage:
  containerName: "waza-results"

enabled

Type: boolean
Default: true (when storage: is configured)

Enable or disable automatic result uploads. When false, results are only saved locally.

storage:
  enabled: true

Example Complete eval.yaml

name: code-explainer-eval
description: "Evaluation suite for code-explainer skill"
skill: code-explainer
version: "1.0"

config:
  trials_per_task: 1
  timeout_seconds: 300
  parallel: false
  workers: 4
  model: claude-sonnet-4.6
  executor: copilot-sdk

graders:
  - type: text
    name: explains_concepts
    config:
      regex_match:
        - "function"
        - "parameter"
        - "return"

  - type: code
    name: minimum_length
    config:
      assertions:
        - "len(output) > 200"

  - type: tool_calls
    name: reasonable_calls
    config:
      max_calls: 5

tasks:
  - "tasks/*.yaml"

JSON Schema (programmatic access)

See schemas/config.schema.json in the repository for complete JSON Schema.

# Validate config
jq . schemas/config.schema.json

Next Steps

Writing Eval Specs — Full guide with examples
CLI Reference — All commands
GitHub Repository — Source code