Graders
Graders are the scoring engine behind every waza evaluation. After an agent executes a task, one or more graders inspect the result and produce a verdict.
How graders work
Section titled “How graders work”- The agent runs a task inside an isolated workspace.
- waza collects the output text, transcript (tool calls, events), session digest (token counts, tools used), and the workspace directory (files on disk).
- Each grader receives that context and returns a result:
| Field | Type | Description |
|---|---|---|
score | float | 0.0 – 1.0 (proportion of checks passed) |
passed | bool | Whether the grader considers the task successful |
feedback | string | Human-readable explanation |
details | object | Structured metadata for debugging |
You can attach graders globally (applied to every task) or per-task in your eval YAML. Each grader also accepts an optional weight field that controls its influence on the composite score (see Weighted Scoring below).
At a glance
Section titled “At a glance”waza ships with several built-in grader types. Pick the right one for the job:
| Type | YAML key | What it checks |
|---|---|---|
| Inline Script | code | Python/JS assertion expressions against output |
| Text | text | Text and regex matching against output |
| File | file | File existence and content patterns in workspace |
| Diff | diff | Workspace files vs. expected snapshots or fragments |
| JSON Schema | json_schema | Output validates against a JSON Schema |
| Prompt (LLM-as-judge) | prompt | A second LLM grades the result |
| Behavior | behavior | Agent metrics — tool calls, tokens, duration |
| Action Sequence | action_sequence | Tool call ordering and completeness |
| Skill Invocation | skill_invocation | Which skills were invoked and in what order |
| Program | program | External command (any language) grades via exit code |
| Tool Constraint | tool_constraint | Validate tool usage constraints (expect/reject lists and argument patterns) |
| Tool Calls | tool_calls | Validate required/forbidden tools and call count bounds |
| Trigger | trigger | Heuristic grader for validating whether a prompt should activate a skill |
Inline Script (code)
Section titled “Inline Script (code)”Evaluates Python or JavaScript assertion expressions against the execution context. Each assertion is a one-liner that must evaluate to True.
- type: code name: output_quality config: language: python # or "javascript" — default is python assertions: - "len(output) > 100" - "'function' in output.lower()" - "len(transcript) > 0"Context variables
Section titled “Context variables”| Variable | Type | Description |
|---|---|---|
output | str | Agent’s final text output |
outcome | dict | Structured outcome state |
transcript | list | Full execution transcript events |
tool_calls | list | Tool calls extracted from transcript |
errors | list | Errors from transcript |
duration_ms | int | Execution wall-clock time |
Built-in functions: len, any, all, str, int, float, bool, list, dict, re
Scoring: passed_assertions / total_assertions
JavaScript example
Section titled “JavaScript example”- type: code name: js_checks config: language: javascript assertions: - "output.length > 50" - "output.includes('hello')"Validates the agent output using substring matching and regex patterns. Supports case-insensitive and case-sensitive substring checks, plus regex pattern matching.
- type: text name: format_checker config: contains: - "deployed to" - "Resource group" not_contains: - "permission denied" regex_match: - "https?://.+" regex_not_match: - "(?i)error|failed|exception"| Option | Type | Description |
|---|---|---|
contains | list[str] | Substrings that must appear (case-insensitive) |
not_contains | list[str] | Substrings that must not appear (case-insensitive) |
contains_cs | list[str] | Substrings that must appear (case-sensitive) |
not_contains_cs | list[str] | Substrings that must not appear (case-sensitive) |
regex_match | list[str] | Regex patterns that must match in output |
regex_not_match | list[str] | Regex patterns that must not match |
Scoring: passed_checks / total_checks
Example: code quality gate
Section titled “Example: code quality gate”- type: text name: code_quality config: contains: - "def " - "return" not_contains: - "TODO" - "FIXME" regex_match: - "def \\w+\\(.*\\):" # Has function definitions regex_not_match: - "print\\(" # No debug printsValidates file existence and content patterns in the agent’s workspace directory. Use when the agent creates, modifies, or should avoid certain files.
- type: file name: project_structure config: must_exist: - "src/index.ts" - "package.json" - "tsconfig.json" must_not_exist: - "node_modules/" - ".env" content_patterns: - path: "package.json" must_match: - '"name":\\s*"my-app"' must_not_match: - '"version":\\s*"0\\.0\\.0"'| Option | Type | Description |
|---|---|---|
must_exist | list[str] | Workspace-relative paths that must be present |
must_not_exist | list[str] | Paths that must not be present |
content_patterns | list | Regex checks against file contents (see below) |
Each content_patterns entry:
| Field | Type | Description |
|---|---|---|
path | string | Workspace-relative file path |
must_match | list[str] | Regex patterns the file content must match |
must_not_match | list[str] | Regex patterns the file must not match |
Compares post-execution workspace files against expected snapshots or content fragments. Ideal for testing file-editing tasks where you know the expected output.
- type: diff name: code_edits config: expected_files: - path: "src/main.py" contains: - "+def new_function():" - "+ return 42" - "-def old_function():" - path: "README.md" snapshot: "expected/README.md"Each expected_files entry supports:
| Field | Type | Description |
|---|---|---|
path | string | Workspace-relative file path (required) |
snapshot | string | Path to expected file for exact matching |
contains | list[str] | Content fragments to check (see prefix rules) |
Contains prefix rules
Section titled “Contains prefix rules”| Prefix | Meaning |
|---|---|
+ | Fragment must be present in the file |
- | Fragment must not be present |
| (none) | Fragment must be present (same as +) |
Scoring: passed_checks / total_checks
- type: diff name: api_updates config: expected_files: - path: "src/api.py" contains: - "+from fastapi import FastAPI" - "+@app.get('/health')" - "-import flask"- type: diff name: exact_config config: expected_files: - path: "config.json" snapshot: "snapshots/expected_config.json"JSON Schema (json_schema)
Section titled “JSON Schema (json_schema)”Validates that the agent output is valid JSON conforming to a given schema. Supports both inline schemas and schema files.
- type: json_schema name: api_response config: schema: type: object required: [status, data] properties: status: type: string enum: [success, error] data: type: object| Option | Type | Description |
|---|---|---|
schema | object | Inline JSON Schema definition |
schema_file | string | Path to a .json schema file |
One of schema or schema_file is required.
Example: schema file
Section titled “Example: schema file”- type: json_schema name: validate_manifest config: schema_file: "schemas/manifest.schema.json"Prompt (LLM-as-judge)
Section titled “Prompt (LLM-as-judge)”Uses a second LLM to evaluate the agent’s work. The judge LLM calls set_waza_grade_pass or set_waza_grade_fail tool functions to render its verdict. This is the most flexible grader — it can assess quality, correctness, style, or anything you can describe in natural language.
- type: prompt name: quality_check config: prompt: | Review the agent's response. Check that the explanation is: 1. Technically accurate 2. Easy to understand 3. Includes code examples
If all criteria are met, call set_waza_grade_pass. Otherwise, call set_waza_grade_fail with your reasoning. model: "gpt-4o-mini"| Option | Type | Default | Description |
|---|---|---|---|
prompt | string | (required) | Instructions for the judge LLM |
model | string | (required) | Model to use for judging |
continue_session | bool | false | Resume the agent’s session (judge sees full context) |
mode | string | independent | Judging mode: independent (standard) or pairwise (compare two, requires --baseline) |
How it works
Section titled “How it works”- waza starts a new Copilot session (or resumes the agent’s session if
continue_session: true). - The judge receives your prompt plus two tool definitions:
set_waza_grade_passandset_waza_grade_fail. - The judge calls one of the tools. If it calls
set_waza_grade_pass, score is1.0; ifset_waza_grade_fail, score is0.0.
Example: file review with continue_session
Section titled “Example: file review with continue_session”- type: prompt name: file_review config: prompt: | Check that the files on disk are properly updated. Verify the code compiles and follows best practices. If correct, call set_waza_grade_pass. If not, call set_waza_grade_fail with your reasoning. model: "claude-sonnet-4.5" continue_session: trueBehavior
Section titled “Behavior”Validates agent behavior metrics — how many tool calls were made, token consumption, required/forbidden tools, and execution duration. Use this to enforce efficiency and safety guardrails.
- type: behavior name: efficiency_check config: max_tool_calls: 15 max_tokens: 50000 max_duration_ms: 60000 required_tools: - bash - edit forbidden_tools: - rm - sudo| Option | Type | Description |
|---|---|---|
max_tool_calls | int | Maximum allowed tool calls (0 = no limit) |
max_tokens | int | Maximum total token usage (0 = no limit) |
max_duration_ms | int | Maximum execution time in ms (0 = no limit) |
required_tools | list[str] | Tool names that must be used |
forbidden_tools | list[str] | Tool names that must not be used |
At least one option must be configured. Each configured rule counts as one check.
Scoring: passed_checks / total_checks
Action Sequence (action_sequence)
Section titled “Action Sequence (action_sequence)”Validates the sequence of tool calls the agent made against an expected action path. Supports three matching modes for different levels of strictness.
- type: action_sequence name: deploy_workflow config: matching_mode: in_order_match expected_actions: - bash - edit - bash - git| Option | Type | Description |
|---|---|---|
expected_actions | list[str] | The expected tool call sequence |
matching_mode | string | How to compare actual vs. expected (see below) |
Matching modes
Section titled “Matching modes”| Mode | Description |
|---|---|
exact_match | Actual tool calls must exactly match the expected list (same tools, same order, same count) |
in_order_match | Expected tools must appear in order, but extra tools between them are allowed |
any_order_match | All expected tools must appear, but order doesn’t matter |
Scoring: F1 score computed from precision (correct calls / total actual) and recall (matched / total expected).
Skill Invocation (skill_invocation)
Section titled “Skill Invocation (skill_invocation)”Validates which Copilot skills the agent invoked during a session. Useful for multi-skill orchestration testing — verifying the agent delegates to the right skills.
- type: skill_invocation name: routing_check config: required_skills: - azure-prepare - azure-deploy mode: in_order allow_extra: true| Option | Type | Default | Description |
|---|---|---|---|
required_skills | list[str] | (required) | Skills that must be invoked |
mode | string | (required) | Matching mode: exact_match, in_order, or any_order |
allow_extra | bool | true | Whether extra skill invocations are penalized |
When allow_extra: false, extra invocations beyond the required list reduce the score by up to 60%.
Scoring: F1 score (precision × recall), with optional penalty for extra invocations.
Tool Constraint (tool_constraint)
Section titled “Tool Constraint (tool_constraint)”Validates which tools an agent should or shouldn’t use, and enforces turn and token limits. Reads from the session digest to check tool usage, turn counts, and total token consumption.
- type: tool_constraint name: guardrails config: expect_tools: - bash - edit reject_tools: - rm - sudo max_turns: 10 max_tokens: 50000| Option | Type | Default | Description |
|---|---|---|---|
expect_tools | list[str] | [] | Tool names that must appear in the session |
reject_tools | list[str] | [] | Tool names that must not appear |
max_turns | int | 0 | Maximum conversation turns (0 = no limit) |
max_tokens | int | 0 | Maximum total token usage (0 = no limit) |
At least one constraint must be configured. Each configured rule counts as one check.
Scoring: passed_checks / total_checks
Example: safety guardrails
Section titled “Example: safety guardrails”- type: tool_constraint name: safety_check config: reject_tools: - rm - sudo - kill max_turns: 15 max_tokens: 100000Example: required workflow tools
Section titled “Example: required workflow tools”- type: tool_constraint name: workflow_tools config: expect_tools: - bash - edit - grepTool Calls (tool_calls)
Section titled “Tool Calls (tool_calls)”Validates which tools an agent called during execution. Supports required tools, forbidden tools, and call-count bounds — all in one grader. Each configured constraint counts as one check; partial credit is awarded.
- type: tool_calls name: tool_usage config: required_tools: - bash - edit forbidden_tools: - rm - sudo min_calls: 2 max_calls: 20| Option | Type | Default | Description |
|---|---|---|---|
required_tools | list[str] | [] | Tool names that must appear in the session |
forbidden_tools | list[str] | [] | Tool names that must not appear |
min_calls | int | 0 | Minimum total tool calls required (0 = no min) |
max_calls | int | 0 | Maximum total tool calls allowed (0 = no limit) |
At least one constraint must be configured. When both min_calls and max_calls are set, min_calls must be ≤ max_calls.
Scoring: passed_checks / total_checks
Example: enforce safe tool usage
Section titled “Example: enforce safe tool usage”- type: tool_calls name: safety_check config: forbidden_tools: - rm - sudo max_calls: 30Example: verify minimum workflow
Section titled “Example: verify minimum workflow”- type: tool_calls name: workflow_check config: required_tools: - bash - edit min_calls: 3Program
Section titled “Program”Runs any external command to grade the agent output. The agent output is passed via stdin, and the workspace directory is available as the WAZA_WORKSPACE_DIR environment variable. Exit code 0 means pass (score 1.0); non-zero means fail (score 0.0).
- type: program name: lint_check config: command: "python3" args: ["scripts/grade.py"] timeout: 60| Option | Type | Default | Description |
|---|---|---|---|
command | string | (required) | Program to execute |
args | list[str] | [] | Arguments passed to the program |
timeout | int | 30 | Max execution time in seconds |
Example: shell script grader
Section titled “Example: shell script grader”- type: program name: build_test config: command: "bash" args: ["-c", "cd $WAZA_WORKSPACE_DIR && npm test"] timeout: 120Example: custom Python grader
Section titled “Example: custom Python grader”#!/usr/bin/env python3"""scripts/grade.py — reads agent output from stdin, exits 0 or 1."""import sys, json, os
output = sys.stdin.read()workspace = os.environ.get("WAZA_WORKSPACE_DIR", "")
# Check that a required file was createdif os.path.exists(os.path.join(workspace, "result.json")): print("✓ result.json created") sys.exit(0)else: print("✗ result.json missing") sys.exit(1)Trigger
Section titled “Trigger”Heuristic grader for validating whether a prompt should activate a skill.
- type: trigger name: deploy_trigger config: skill_path: skills/azure-deploy/SKILL.md mode: positive threshold: 0.6| Option | Type | Description |
|---|---|---|
skill_path | string | (required) Path to the SKILL.md file or the skill directory containing it. |
mode | string | Either positive or negative |
threshold | number | Score threshold between 0.0 and 1.0. Default is 0.6. |
For more information, see the triggers grader documentation
Using graders in eval YAML
Section titled “Using graders in eval YAML”Global graders
Section titled “Global graders”Defined at the top level of eval.yaml, applied to every task:
graders: - type: text name: no_errors config: regex_not_match: - "(?i)fatal error|crashed|exception occurred"
- type: code name: has_output config: assertions: - "len(output) > 10"
tasks: - task_files: ["tasks/*.yaml"]Per-task graders (by reference)
Section titled “Per-task graders (by reference)”Define graders globally, then reference them by name in individual tasks:
graders: - type: text name: format_check config: regex_match: ["^[A-Z]"]
- type: code name: length_check config: assertions: - "len(output) > 100"
tasks: - id: task-001 inputs: prompt: "Explain this code" expected: graders: - format_check - length_checkPer-task graders (inline)
Section titled “Per-task graders (inline)”Define graders directly inside a task:
tasks: - id: task-001 inputs: prompt: "Create a REST API" expected: graders: - type: file name: api_files config: must_exist: - "src/api.py" - "requirements.txt" - type: diff name: api_content config: expected_files: - path: "src/api.py" contains: - "+from fastapi import FastAPI"Weighted Scoring
Section titled “Weighted Scoring”By default every grader counts equally toward the composite score. Add a weight field to shift importance:
graders: - type: text name: critical_check weight: 3.0 # Counts 3× config: regex_match: ["deployed"]
- type: text name: nice_to_have weight: 0.5 # Counts 0.5× config: contains: [summary]
- type: code name: basic_length # weight omitted → defaults to 1.0 config: assertions: - "len(output) > 50"| Option | Type | Default | Description |
|---|---|---|---|
weight | float | 1.0 | Relative importance of this grader in the composite score |
Formula: (score₁ × weight₁ + score₂ × weight₂ + …) / (weight₁ + weight₂ + …)
With the config above and scores of 1.0, 0.0, and 1.0, the composite score is (1.0×3 + 0.0×0.5 + 1.0×1) / (3+0.5+1) = 0.89.
Combining graders
Section titled “Combining graders”You can stack multiple graders on a single task. All graders run independently and each produces its own score. A task passes when all graders pass.
graders: # 1. Output mentions key concepts - type: text name: concepts config: contains: [authentication, JWT, middleware]
# 2. No error patterns - type: text name: no_errors config: regex_not_match: ["(?i)error|exception"]
# 3. Required files exist - type: file name: deliverables config: must_exist: ["src/auth.ts", "src/middleware.ts"]
# 4. Agent was efficient - type: behavior name: efficiency config: max_tool_calls: 20 max_tokens: 40000
# 5. LLM judge confirms quality - type: prompt name: quality config: prompt: | Review the implementation for security best practices. Call set_waza_grade_pass if secure, set_waza_grade_fail if not. model: gpt-4o-miniReal-world examples
Section titled “Real-world examples”Skill trigger accuracy
Section titled “Skill trigger accuracy”Verify a skill activates on the right prompts and stays silent on the wrong ones:
tasks: - id: should-trigger inputs: prompt: "Deploy my app to Azure" expected: graders: - type: text name: azure_response config: contains: [azure, deploy, resource]
- id: should-not-trigger inputs: prompt: "What's the weather today?" expected: graders: - type: text name: no_azure config: not_contains: [azure, deploy, bicep]Code editing task
Section titled “Code editing task”Test that the agent correctly modifies source files:
graders: - type: file name: files_created config: must_exist: - "src/utils.ts" - "tests/utils.test.ts"
- type: diff name: correct_edits config: expected_files: - path: "src/utils.ts" contains: - "+export function formatDate" - "+export function parseConfig" - path: "tests/utils.test.ts" contains: - "+describe('formatDate'"
- type: program name: tests_pass config: command: bash args: ["-c", "cd $WAZA_WORKSPACE_DIR && npm test --silent"] timeout: 60Multi-skill orchestration
Section titled “Multi-skill orchestration”Verify the agent invokes the right skills in the right order:
graders: - type: skill_invocation name: correct_workflow config: required_skills: - brainstorming - azure-prepare - azure-deploy mode: in_order allow_extra: false
- type: action_sequence name: tool_usage config: matching_mode: in_order_match expected_actions: - bash - create - edit - bashBest practices
Section titled “Best practices”- Start simple — Begin with
keywordorregexgraders, then add stricter graders as you identify failure modes. - Layer your checks — Combine output graders (
regex,keyword) with workspace graders (file,diff) and behavior graders for comprehensive coverage. - Use descriptive names —
checks_auth_flowbeatsgrader1. Names appear in the dashboard and CLI output. - Use
promptfor subjective quality — When you can’t express the check as a pattern or assertion, let an LLM judge it. - Set behavior budgets — Use the
behaviorgrader to catch runaway agents that burn too many tokens or tool calls. - Test graders in isolation — Run a single task with
waza run eval.yaml --task my-task -vto verify graders before running the full suite. - Use
programas an escape hatch — When you need full programmatic control, write a script in any language and use theprogramgrader.
Next steps
Section titled “Next steps”- Writing Eval Specs — Task and fixture configuration
- Web Dashboard — Visualize grader results
- CLI Reference — All commands and flags