YAML Schema

Complete reference for YAML schema used in waza evaluations.

eval.yaml

Main evaluation configuration file.

name: code-explainer-eval # Required: eval suite name
description: "..." # Required: what this eval tests
skill: code-explainer # Required: skill name
schemaVersion: "1.2" # Optional: schema version; defaults to the current version
version: "1.0" # Optional: version number

config:
  trials_per_task: 1 # Runs per task
  timeout_seconds: 300 # Task timeout
  parallel: false # Concurrent execution
  workers: 0 # Auto-size parallel workers
  model: claude-sonnet-4.6 # Default model
  executor: mock # mock or copilot-sdk

graders: # Validation rules
  - type: text
    name: checks_logic
    config:
      regex_match:
        - "(?i)(function)"

tasks: # Test cases
  - "tasks/*.yaml" # From files

mcp_mocks: # Optional: deterministic local MCP server mocks
  - name: github
    tools:
      list_issues:
        responses:
          - match: { owner: microsoft, repo: waza }
            return:
              issues: []

adversarial: # Optional: fault-injection packs for waza adversarial
  packs:
    - prompt-injection
    - scope-bypass
  on_unsafe_outcome: fail

Top-Level Fields

name

Type: string
Required: yes

The unique name of this evaluation suite. Used in reports and result files.

name: code-explainer-eval

description

Type: string
Required: yes

Describes what this evaluation tests. Appears in reports.

description: "Evaluates agent's ability to explain Python code"

skill

Type: string
Required: yes

The skill being evaluated.

skill: code-explainer

version

Type: string
Required: no
Default: (empty)

Version number of the evaluation suite.

version: "1.0"

schemaVersion

Type: string
Required: no
Default: current schema version (currently 1.2)

Schema version for the public artifact shape. It uses MAJOR.MINOR format with no patch component. Same-major minor additions are backward-compatible; readers ignore unknown fields and warn. Different major versions are rejected with a migration hint.

schemaVersion: "1.2"

See Schema Changes for the versioning policy and changelog.

mcp_mocks

Type: array Required: no Requires: schemaVersion: "1.1" or newer

Defines deterministic MCP server mocks for Copilot SDK evals. Waza exposes each mock as a local stdio MCP server, so tests do not need live network services, credentials, or fixed ports.

schemaVersion: "1.1"
mcp_mocks:
  - name: github
    fixtures: fixtures/mcp/github

You can define tools inline:

mcp_mocks:
  - name: github
    tools:
      list_issues:
        description: Return issues from a fixture
        input_schema:
          type: object
          properties:
            owner: { type: string }
            repo: { type: string }
          required: [owner, repo]
        responses:
          - match:
              owner: microsoft
              repo: waza
            return:
              issues:
                - number: 363
                  title: MCP server mocks for hermetic eval
          - match_regex:
              repo: "^waza-.*"
            return:
              issues: []
          - match_schema:
              type: object
              required: [owner, repo]
            error: "No fixture for this repository"

Field	Type	Description
`name`	string	MCP server name made available to the agent
`fixtures`	string	Directory of JSON tool fixtures, resolved relative to `eval.yaml`
`tools`	object	Inline tool definitions keyed by tool name
`description`	string	Tool description returned from `tools/list`
`input_schema`	object	JSON Schema returned from `tools/list`
`responses`	array	Ordered responses for `tools/call`
`match`	object	Exact full-argument match
`match_schema`	object	JSON Schema that the call arguments must satisfy
`match_regex`	object	Per-field regular expression match
`return`	any	JSON value returned as the mocked tool result
`error`	string	Tool error text returned with `isError: true`

Response entries are checked in order. Unknown tools and unmatched calls fail loudly with a tool error that names the missing mock fixture.

adversarial

Type: object Required: no Requires: schemaVersion: "1.2" or newer

Declares built-in adversarial / fault-injection packs for waza adversarial --spec. The normal waza run command ignores this block; it is consumed by the dedicated adversarial harness.

schemaVersion: "1.2"
adversarial:
  packs:
    - prompt-injection
    - scope-bypass
  on_unsafe_outcome: fail

Field	Type	Required	Description
`packs`	array of strings	yes	Built-in pack identifiers such as `prompt-injection` and `scope-bypass`
`on_unsafe_outcome`	string	no	`fail` (default, exits 2 on unsafe) or `warn` (records unsafe outcomes but exits 0)

Unknown pack names and empty packs lists are rejected with a configuration error.

config Section

trials_per_task

Type: integer
Default: 1

How many times each task is run. Use > 1 for statistical confidence.

config:
  trials_per_task: 3

timeout_seconds

Type: integer
Default: 300

Maximum seconds per task before timeout.

config:
  timeout_seconds: 300 # 5 minutes

Common values:

60 — Quick validation tasks
300 — Standard code analysis
600 — Complex multi-file tasks

first_event_timeout_seconds

Type: integer
Default: 0 (disabled)

Maximum seconds to wait for the first session event before treating a run as a session-start hang and aborting it. The embedded engine can occasionally launch but never start the agent’s first turn (no events at all); without this guard such a run blocks until timeout_seconds, which must be sized for the slowest legitimate full turn and is therefore a poor catch-all for a fast failure. Keep this comfortably above realistic first-turn latency so a slow-but-live first turn is never aborted. 0 disables the check. Override per task with the task-level field of the same name (0 disables it for that task).

config:
  timeout_seconds: 1800 # 30 min — the slowest legitimate full turn
  first_event_timeout_seconds: 300 # but the first turn must start within 5 min

parallel

Type: boolean
Default: false

Run tasks concurrently instead of sequentially.

config:
  parallel: true

workers

Type: integer
Default: 0

Number of concurrent workers when parallel: true. Use 0 or omit the field to auto-size workers.

config:
  parallel: true
  workers: 8

model

Type: string
Default: (empty)

Default LLM model. Override with --model flag.

config:
  model: claude-sonnet-4.6

executor

Type: string
Default: mock
Options: mock, copilot-sdk

Execution engine:

mock — Local testing (no API calls)
copilot-sdk — Real LLM execution

config:
  executor: copilot-sdk

max_attempts

Type: integer
Default: 0

Maximum retry attempts per task on failure. Set to 0 for no retries.

config:
  max_attempts: 3

group_by

Type: string
Default: (none)

Group results by a field in the output (e.g., tags, task_id). Useful for organizing results when running many tasks.

config:
  group_by: tags

fail_fast

Type: boolean
Default: false

Stop the entire evaluation run on the first task failure instead of continuing.

config:
  fail_fast: true

skill_directories

Type: list[string]
Default: []

Additional directories to search for skills beyond the default skills/ directory.

config:
  skill_directories:
    - ./custom-skills
    - /opt/shared-skills

instruction_files

Type: list[string] Default: []

Instruction files to apply to every task. Paths are resolved relative to the active fixtures/context directory, copied into each task workspace, and appended to the agent system message as path-labeled instructions.

config:
  instruction_files:
    - .github/instructions/project.instructions.md

inject_skill_body

Type: boolean Default: true

Controls whether Waza injects the full target SKILL.md or .agent.md body into the system prompt when skill: is set.

config:
  inject_skill_body: false

When false, Waza keeps skill discovery enabled and still includes the compact <available_skills> summary with names and descriptions, but it does not add the target <skill_context> block. This is useful for trigger-precision evals that use behavior or skill_invocation graders to measure whether the agent actually invokes the skill tool. disabled_skills: ["*"] takes precedence and disables all skill loading.

required_skills

Type: list[string]
Default: []

Skills that must be available before the evaluation runs. The run will fail if any required skill is not found.

config:
  required_skills:
    - code-analyzer
    - test-runner

disabled_skills

Type: list[string]
Default: []

Skills to disable for the evaluation. Use ["*"] to disable all skill loading entirely, or list specific skill directory names to exclude.

# Disable all skills
config:
  disabled_skills: ["*"]

# Disable specific skills
config:
  disabled_skills:
    - noisy-skill
    - experimental-skill

mcp_servers

Type: object
Default: (none)

MCP (Model Context Protocol) server configurations for this evaluation. Each key is a server name, and the value is its configuration.

config:
  mcp_servers:
    filesystem:
      command: /path/to/server
      args: [--root, /data]
    github:
      url: http://localhost:3000

Use top-level mcp_mocks for hermetic evals that should not call live MCP services.

graders Section

List of validation rules. Used across tasks.

graders:
  - type: text
    name: pattern_check
    config:
      regex_match: ["success"]

  - type: code
    name: logic_check
    config:
      assertions:
        - "len(output) > 0"

Grader Fields

Field	Type	Description
`type`	string	Grader type: `code`, `prompt`, `text`, `file`, `json_schema`, `program`, `behavior`, `action_sequence`, `skill_invocation`, `trigger`, `diff`, `tool_constraint`, `tool_calls`
`name`	string	Unique grader name (used to reference in tasks)
`weight`	float	Relative importance in composite scoring (default: `1.0`)
`config`	object	Type-specific configuration

See Validators & Graders for complete documentation.

tasks Section

List of test cases to run.

tasks:
  - "tasks/*.yaml" # Load from files
  - "tasks/basic.yaml" # Specific file

Loading from Files

tasks:
  - "tasks/*.yaml" # All YAML files
  - "tasks/basic/*.yaml" # Subdirectory
  - "tasks/important.yaml" # Single file

File path is relative to eval.yaml.

Task File Format

Individual task files (tasks/task-name.yaml).

id: basic-usage-001 # Required: unique ID
name: Basic Usage # Required: display name
description: "..." # Optional: full description

tags: # Optional: for filtering
  - basic
  - happy-path

inputs: # Required: test inputs
  prompt: "Your instruction"
  files:
    - path: sample.py

expected: # Required: validation rules
  output_contains: ["function"]
  behavior:
    max_tool_calls: 5

id

Type: string
Required: yes

Unique task identifier within the eval suite.

id: basic-usage-001

name

Type: string
Required: yes

Human-readable task name.

name: "Basic Usage - Python Function"

description

Type: string
Required: no

Full description of what the task tests.

description: "Test that the agent explains a simple Python function correctly"

skill_directories (optional)

Override the eval-level skill_directories for a specific task. Paths are resolved relative to the eval YAML directory.

skill_directories:
  - ./skills/custom-task-skills

When specified on a task, this replaces (not merges with) the eval-level skill_directories.

instruction_files (optional)

Apply additional instruction files for a specific task. Paths are resolved relative to the active fixtures/context directory. Task-level entries are added to the eval-level config.instruction_files list.

instruction_files:
  - .github/instructions/review.instructions.md

inputs Section

Test inputs passed to the agent.

inputs:
  prompt: "Your instruction here"
  files:
    - path: sample.py
    - path: nested/module.py
    - content: |
        def hello():
            print("Hi")

prompt

Type: string
Required: yes (unless prompt_file is used)

Instruction text sent to the agent. Supports templating:

inputs:
  prompt: |
    Explain this Python code:
    {{fixture:sample.py}}

prompt_file

Type: string
Required: no (alternative to prompt)

Path to a file containing the prompt text, resolved relative to the task YAML file’s directory. Use this when prompts are long or shared across tasks. You must specify either prompt or prompt_file, but not both.

inputs:
  prompt_file: prompts/review-instructions.md

follow_up_prompts

Type: array of strings
Required: no

Sequential follow-up prompts sent after the initial prompt. Each follow-up reuses the same session and workspace directory, preserving conversation history and file changes across turns.

inputs:
  prompt: "Create a helper function"
  follow_up_prompts:
    - "Add error handling"
    - "Write tests"

Graders evaluate only the final state after all prompts complete. If any follow-up fails, remaining prompts are skipped and the run is marked as an error.

responder

Type: object
Required: no

An LLM-backed surrogate user that answers the skill’s follow-up questions during a multi-turn run. Mutually exclusive with follow_up_prompts.

Field	Type	Required	Description
`model`	string	no	Responder model. Defaults to the eval-level `config.model`.
`instructions`	string	yes	Target configuration the responder represents + abstain rule.
`max_followups`	integer	yes	Max responder replies before the loop stops (`>= 1`).

inputs:
  prompt: "Add a new agent to my application"
  responder:
    instructions: "Be research-agent with tools web_search; abstain if unknown."
    max_followups: 8

The responder classifies each agent message as reply, stop, or abstain. An abstain marks the run as an error with outcome abstained, distinct from model timeouts or network errors. If max_followups is reached while the agent is still asking questions, the loop stops with outcome cap_exhausted and graders evaluate the final state.

checkpoints

Type: array of objects
Required: no

Per-turn graders that run at specific turn boundaries during multi-turn conversations. Useful for asserting intermediate state instead of only the final output. Added in schemaVersion: "1.1".

Field	Type	Required	Description
`after_turn`	integer	yes	1-based turn number to grade after. Turn 1 is the initial prompt; follow-ups are 2, 3, …
`graders`	array	yes	Inline graders (same schema as the task-level `graders:` field).
`on_failure`	string	no	`continue` (default) or `stop`. `stop` aborts remaining turns when this checkpoint fails.

checkpoints:
  - after_turn: 1
    graders:
      - type: text
        contains: ["analyzing", "files"]
  - after_turn: 2
    on_failure: stop
    graders:
      - type: tool_calls
        required: ["read_file"]

Each checkpoint’s outcomes appear on results.json under checkpoints[]. The final-pass validations still run independently. waza gate continues to use the final-pass status; checkpoint failures flip the task status to failed (or error when on_failure: stop fired).

files

Type: array
Required: no

Files to include in the task context.

inputs:
  files:
    - path: sample.py # Reference fixture
    - content: "def foo(): ..." # Or inline content

repos

Type: array
Required: no

Local git repositories to materialize into the per-task workspace before the agent runs. Useful when developing skills inside the same repo the skills operate on — each test run gets a clean isolated checkout instead of hand-staged fixtures.

inputs:
  prompt: "Explain the repo layout"
  workdir: my-repo
  repos:
    - type: worktree        # required; only "worktree" is currently supported
      source: /path/to/local/clone   # required; local git repo to source from
      commit: main          # optional; commit SHA, branch, or tag (defaults to HEAD)
      dest: my-repo         # optional; subdir under workspace (omit to use workspace root)

Field	Required	Description
`type`	yes	Materialization strategy. Currently only `worktree` (uses `git worktree add --detach` against a local source).
`source`	yes	Local filesystem path to a git repository.
`commit`	no	Commit SHA, branch, or tag. Defaults to source HEAD. Branch/tag names use `--detach` so they don’t conflict with the source checkout.
`dest`	yes	Relative subdirectory under the workspace where the repo is materialized. Required because `git worktree add` refuses targets that already exist. Must not contain `..` segments.

Worktrees are removed via git worktree remove --force when the engine shuts down, before the workspace directory is deleted.

workdir

Type: string
Required: no

Relative path inside the workspace to use as the agent’s working directory. Typically set to the same value as a repos[*].dest so the agent starts inside the checked-out repo. Must not contain path traversal.

inputs:
  workdir: my-repo

expected Section

Validation rules and constraints.

expected:
  output_contains: ["function", "parameter"]
  output_not_contains: ["error"]
  output_contains_any: ["recursion", "iteration", "loop"]
  outcomes:
    - type: task_completed
  behavior:
    max_tool_calls: 5
    max_tokens: 4096

output_contains

Type: array of strings

Strings that must appear in output.

expected:
  output_contains:
    - "function"
    - "parameter"
    - "return"

output_not_contains

Type: array of strings

Strings that must NOT appear.

expected:
  output_not_contains:
    - "error"
    - "failed"

output_contains_any

Type: array of strings

At least one of these strings must appear in the output (OR logic). Useful when an agent may express a concept in different ways. All checks are case-insensitive.

expected:
  output_contains_any:
    - "recursion"
    - "iteration"
    - "loop"

matches

Type: array of strings

Regex patterns to match.

expected:
  matches:
    - "returns\\s+.*value"
    - "def\\s+\\w+\\("

outcomes

Type: array of objects

Expected execution outcomes.

expected:
  outcomes:
    - type: task_completed
    - type: tool_called
      tool_name: code_analyzer

behavior

Type: object

Behavioral constraints on agent execution. Each constraint that is set contributes to the behavior efficiency score.

Field	Type	Description
`max_tool_calls`	int	Maximum tool invocations allowed
`max_iterations`	int	Maximum conversation rounds (turns)
`max_tokens`	int	Maximum tokens in the response
`max_response_time_ms`	int	Maximum wall-clock execution time in milliseconds
`required_tools`	string[]	Tools the agent must use
`forbidden_tools`	string[]	Tools the agent must NOT use

expected:
  behavior:
    max_tool_calls: 5
    max_iterations: 10
    max_tokens: 4096
    max_response_time_ms: 30000 # 30 seconds
    required_tools:
      - grep
    forbidden_tools:
      - rm

.waza.yaml Configuration

Optional project-level configuration file.

# Root of waza project
paths:
  skills: skills
  evals: evals
  results: results

# Generated/discovered eval and task naming
files:
  evalFile: eval.yaml
  taskGlob: tasks/*.yaml
  taskFileSuffix: .yaml

# Model defaults
defaults:
  model: claude-sonnet-4.6
  timeout: 300
  workers: 0

# Cache settings
cache:
  enabled: false
  dir: .waza-cache

# Token budget configuration
tokens:
  warningThreshold: 2500
  fallbackLimit: 1000
  limits:
    defaults:
      "SKILL.md": 500
      "references/**/*.md": 1000
      "*.md": 2000
    overrides:
      "README.md": 3000

# Cloud storage (optional)
storage:
  provider: azure-blob
  accountName: "myteamwaza"
  containerName: "waza-results"
  enabled: true

APM-managed skills are detected from <paths.skills>/<name>/.apm/skills/<name>/SKILL.md without extra configuration. Standard top-level SKILL.md files in paths.skills take precedence when both layouts exist for the same skill.

files Section

Controls the filenames generated by waza new skill and waza new eval. Workspace discovery prefers files.evalFile and falls back to eval.yaml for compatibility.

Field	Type	Default	Description
`evalFile`	string	`eval.yaml`	Eval filename to generate and discover
`taskGlob`	string	`tasks/*.yaml`	Task glob written into generated eval specs
`taskFileSuffix`	string	`.yaml`	Suffix for generated task files

For editor-friendly Waza-specific names:

files:
  evalFile: waza-eval.yaml
  taskGlob: tasks/*.waza-task.yaml
  taskFileSuffix: .waza-task.yaml

tokens Section

Per-file token budget configuration used by waza tokens check and waza tokens suggest. See the Token Limits guide for full documentation including pattern matching rules and migration from .token-limits.json.

Priority order: .waza.yaml tokens.limits → .token-limits.json (legacy, emits deprecation warning) → built-in defaults.

Field	Type	Default	Description
`warningThreshold`	integer	`2500`	Token count at which a soft warning is shown
`fallbackLimit`	integer	`1000`	Limit applied to files that match no pattern
`limits.defaults`	map	(built-in)	Glob patterns → token limits
`limits.overrides`	map	`{}`	Exact file paths → token limits (take precedence over `defaults`)

Both limits.defaults and limits.overrides are optional. When tokens.limits is present in .waza.yaml, .token-limits.json is not consulted.

storage Section

Configuration for uploading evaluation results to cloud storage.

provider

Type: string
Values: azure-blob

The cloud provider to use for result storage.

storage:
  provider: azure-blob

accountName

Type: string
Required: yes (when storage: is configured)

The Azure Storage account name. Results are uploaded to blob storage in this account.

storage:
  accountName: "myteamwaza"

containerName

Type: string
Default: waza-results

The blob container where results are stored.

storage:
  containerName: "waza-results"

enabled

Type: boolean
Default: true (when storage: is configured)

Enable or disable automatic result uploads. When false, results are only saved locally.

storage:
  enabled: true

Example Complete eval.yaml

name: code-explainer-eval
description: "Evaluation suite for code-explainer skill"
skill: code-explainer
version: "1.0"

config:
  trials_per_task: 1
  timeout_seconds: 300
  parallel: false
  workers: 0
  model: claude-sonnet-4.6
  executor: copilot-sdk

graders:
  - type: text
    name: explains_concepts
    config:
      regex_match:
        - "function"
        - "parameter"
        - "return"

  - type: code
    name: minimum_length
    config:
      assertions:
        - "len(output) > 200"

  - type: tool_calls
    name: reasonable_calls
    config:
      max_calls: 5

tasks:
  - "tasks/*.yaml"

JSON Schema (programmatic access)

See schemas/config.schema.json in the repository for complete JSON Schema.

# Validate config
jq . schemas/config.schema.json

Next Steps

Writing Eval Specs — Full guide with examples
CLI Reference — All commands
GitHub Repository — Source code