Skip to content

YAML Schema

Complete reference for YAML schema used in waza evaluations.

Main evaluation configuration file.

name: code-explainer-eval # Required: eval suite name
description: "..." # Required: what this eval tests
skill: code-explainer # Required: skill name
version: "1.0" # Optional: version number
config:
trials_per_task: 1 # Runs per task
timeout_seconds: 300 # Task timeout
parallel: false # Concurrent execution
workers: 4 # Parallel workers
model: claude-sonnet-4.6 # Default model
executor: mock # mock or copilot-sdk
graders: # Validation rules
- type: text
name: checks_logic
config:
regex_match: "(?i)(function)"
tasks: # Test cases
- "tasks/*.yaml" # From files
- id: inline-001 # Or inline
name: Inline Task
inputs:
prompt: "..."

Type: string
Required: yes

The unique name of this evaluation suite. Used in reports and result files.

name: code-explainer-eval

Type: string
Required: yes

Describes what this evaluation tests. Appears in reports.

description: "Evaluates agent's ability to explain Python code"

Type: string
Required: yes

The skill being evaluated.

skill: code-explainer

Type: string
Required: no
Default: (empty)

Version number of the evaluation suite.

version: "1.0"

Type: integer
Default: 1

How many times each task is run. Use > 1 for statistical confidence.

config:
trials_per_task: 3

Type: integer
Default: 300

Maximum seconds per task before timeout.

config:
timeout_seconds: 300 # 5 minutes

Common values:

  • 60 — Quick validation tasks
  • 300 — Standard code analysis
  • 600 — Complex multi-file tasks

Type: boolean
Default: false

Run tasks concurrently instead of sequentially.

config:
parallel: true

Type: integer
Default: 4

Number of concurrent workers when parallel: true.

config:
parallel: true
workers: 8

Type: string
Default: (empty)

Default LLM model. Override with --model flag.

config:
model: claude-sonnet-4.6

Type: string
Default: mock
Options: mock, copilot-sdk

Execution engine:

  • mock — Local testing (no API calls)
  • copilot-sdk — Real LLM execution
config:
executor: copilot-sdk

List of validation rules. Used across tasks.

graders:
- type: text
name: pattern_check
config:
regex_match: ["success"]
- type: code
name: logic_check
config:
assertions:
- "len(output) > 0"
FieldTypeDescription
typestringGrader type: code, regex, keyword, file, diff, json_schema, prompt, behavior, action_sequence, skill_invocation, program
namestringUnique grader name (used to reference in tasks)
weightfloatRelative importance in composite scoring (default: 1.0)
configobjectType-specific configuration

See Validators & Graders for complete documentation.

List of test cases to run.

tasks:
- "tasks/*.yaml" # Load from files
- "tasks/basic.yaml" # Specific file
- id: inline-001 # Inline task
name: "Test 1"
inputs:
prompt: "..."
tasks:
- "tasks/*.yaml" # All YAML files
- "tasks/basic/*.yaml" # Subdirectory
- "tasks/important.yaml" # Single file

File path is relative to eval.yaml.

Define tasks directly in eval.yaml:

tasks:
- id: task-001
name: "Basic Usage"
inputs:
prompt: "Explain this"

Individual task files (tasks/task-name.yaml).

id: basic-usage-001 # Required: unique ID
name: Basic Usage # Required: display name
description: "..." # Optional: full description
tags: # Optional: for filtering
- basic
- happy-path
inputs: # Required: test inputs
prompt: "Your instruction"
files:
- path: sample.py
expected: # Required: validation rules
output_contains: ["function"]
behavior:
max_tool_calls: 5

Type: string
Required: yes

Unique task identifier within the eval suite.

id: basic-usage-001

Type: string
Required: yes

Human-readable task name.

name: "Basic Usage - Python Function"

Type: string
Required: no

Full description of what the task tests.

description: "Test that the agent explains a simple Python function correctly"

Type: array of strings
Required: no

Tags for filtering and categorization.

tags:
- basic
- happy-path
- python

Usage:

Terminal window
waza run eval.yaml --tags "basic"
waza run eval.yaml --tags "edge-case"

Test inputs passed to the agent.

inputs:
prompt: "Your instruction here"
files:
- path: sample.py
- path: nested/module.py
- content: |
def hello():
print("Hi")

Type: string
Required: yes

Instruction text sent to the agent. Supports templating:

inputs:
prompt: |
Explain this Python code:
{{fixture:sample.py}}

Type: array
Required: no

Files to include in the task context.

inputs:
files:
- path: sample.py # Reference fixture
- content: "def foo(): ..." # Or inline content

Validation rules and constraints.

expected:
output_contains: ["function", "parameter"]
output_excludes: ["error"]
matches: ["def\\s+\\w+"]
outcomes:
- type: task_completed
behavior:
max_tool_calls: 5
max_response_time_ms: 30000
max_tokens: 4096

Type: array of strings

Strings that must appear in output.

expected:
output_contains:
- "function"
- "parameter"
- "return"

Type: array of strings

Strings that must NOT appear.

expected:
output_excludes:
- "error"
- "failed"

Type: array of strings

Regex patterns to match.

expected:
matches:
- "returns\\s+.*value"
- "def\\s+\\w+\\("

Type: array of objects

Expected execution outcomes.

expected:
outcomes:
- type: task_completed
- type: tool_called
tool_name: code_analyzer

Type: object

Behavioral constraints.

expected:
behavior:
max_tool_calls: 5 # Max tools to invoke
max_response_time_ms: 30000 # Max execution time
max_tokens: 4096 # Max tokens in response

Optional project-level configuration file.

# Root of waza project
paths:
skills: skills
evals: evals
results: results
# Model defaults
defaults:
model: claude-sonnet-4.6
timeout: 300
workers: 4
# Cache settings
cache:
enabled: false
dir: .waza-cache
# Cloud storage (optional)
storage:
provider: azure-blob
accountName: "myteamwaza"
containerName: "waza-results"
enabled: true

Configuration for uploading evaluation results to cloud storage.

Type: string
Values: azure-blob

The cloud provider to use for result storage.

storage:
provider: azure-blob

Type: string
Required: yes (when storage: is configured)

The Azure Storage account name. Results are uploaded to blob storage in this account.

storage:
accountName: "myteamwaza"

Type: string
Default: waza-results

The blob container where results are stored.

storage:
containerName: "waza-results"

Type: boolean
Default: true (when storage: is configured)

Enable or disable automatic result uploads. When false, results are only saved locally.

storage:
enabled: true
name: code-explainer-eval
description: "Evaluation suite for code-explainer skill"
skill: code-explainer
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
workers: 4
model: claude-sonnet-4.6
executor: copilot-sdk
graders:
- type: text
name: explains_concepts
config:
regex_match:
- "function"
- "parameter"
- "return"
- type: code
name: minimum_length
config:
assertions:
- "len(output) > 200"
- type: tool_calls
name: reasonable_calls
config:
max_calls: 5
tasks:
- "tasks/*.yaml"

See schemas/config.schema.json in the repository for complete JSON Schema.

Terminal window
# Validate config
jq . schemas/config.schema.json