Skip to content

Graders

Graders are the scoring engine behind every waza evaluation. After an agent executes a task, one or more graders inspect the result and produce a verdict.

  1. The agent runs a task inside an isolated workspace.
  2. waza collects the output text, transcript (tool calls, events), session digest (token counts, tools used), and the workspace directory (files on disk).
  3. Each grader receives that context and returns a result:
FieldTypeDescription
scorefloat0.0 – 1.0 (proportion of checks passed)
passedboolWhether the grader considers the task successful
feedbackstringHuman-readable explanation
detailsobjectStructured metadata for debugging

You can attach graders globally (applied to every task) or per-task in your eval YAML. Each grader also accepts an optional weight field that controls its influence on the composite score (see Weighted Scoring below).


waza ships with several built-in grader types. Pick the right one for the job:

TypeYAML keyWhat it checks
Inline ScriptcodePython/JS assertion expressions against output
TexttextText and regex matching against output
FilefileFile existence and content patterns in workspace
DiffdiffWorkspace files vs. expected snapshots or fragments
JSON Schemajson_schemaOutput validates against a JSON Schema
Prompt (LLM-as-judge)promptA second LLM grades the result
BehaviorbehaviorAgent metrics — tool calls, tokens, duration
Action Sequenceaction_sequenceTool call ordering and completeness
Skill Invocationskill_invocationWhich skills were invoked and in what order
ProgramprogramExternal command (any language) grades via exit code

Evaluates Python or JavaScript assertion expressions against the execution context. Each assertion is a one-liner that must evaluate to True.

- type: code
name: output_quality
config:
language: python # or "javascript" — default is python
assertions:
- "len(output) > 100"
- "'function' in output.lower()"
- "len(transcript) > 0"
VariableTypeDescription
outputstrAgent’s final text output
outcomedictStructured outcome state
transcriptlistFull execution transcript events
tool_callslistTool calls extracted from transcript
errorslistErrors from transcript
duration_msintExecution wall-clock time

Built-in functions: len, any, all, str, int, float, bool, list, dict, re

Scoring: passed_assertions / total_assertions

- type: code
name: js_checks
config:
language: javascript
assertions:
- "output.length > 50"
- "output.includes('hello')"

Validates the agent output using substring matching and regex patterns. Supports case-insensitive and case-sensitive substring checks, plus regex pattern matching.

- type: text
name: format_checker
config:
contains:
- "deployed to"
- "Resource group"
not_contains:
- "permission denied"
regex_match:
- "https?://.+"
regex_not_match:
- "(?i)error|failed|exception"
OptionTypeDescription
containslist[str]Substrings that must appear (case-insensitive)
not_containslist[str]Substrings that must not appear (case-insensitive)
contains_cslist[str]Substrings that must appear (case-sensitive)
not_contains_cslist[str]Substrings that must not appear (case-sensitive)
regex_matchlist[str]Regex patterns that must match in output
regex_not_matchlist[str]Regex patterns that must not match

Scoring: passed_checks / total_checks

- type: text
name: code_quality
config:
contains:
- "def "
- "return"
not_contains:
- "TODO"
- "FIXME"
regex_match:
- "def \\w+\\(.*\\):" # Has function definitions
regex_not_match:
- "print\\(" # No debug prints

Validates file existence and content patterns in the agent’s workspace directory. Use when the agent creates, modifies, or should avoid certain files.

- type: file
name: project_structure
config:
must_exist:
- "src/index.ts"
- "package.json"
- "tsconfig.json"
must_not_exist:
- "node_modules/"
- ".env"
content_patterns:
- path: "package.json"
must_match:
- '"name":\\s*"my-app"'
must_not_match:
- '"version":\\s*"0\\.0\\.0"'
OptionTypeDescription
must_existlist[str]Workspace-relative paths that must be present
must_not_existlist[str]Paths that must not be present
content_patternslistRegex checks against file contents (see below)

Each content_patterns entry:

FieldTypeDescription
pathstringWorkspace-relative file path
must_matchlist[str]Regex patterns the file content must match
must_not_matchlist[str]Regex patterns the file must not match

Compares post-execution workspace files against expected snapshots or content fragments. Ideal for testing file-editing tasks where you know the expected output.

- type: diff
name: code_edits
config:
expected_files:
- path: "src/main.py"
contains:
- "+def new_function():"
- "+ return 42"
- "-def old_function():"
- path: "README.md"
snapshot: "expected/README.md"

Each expected_files entry supports:

FieldTypeDescription
pathstringWorkspace-relative file path (required)
snapshotstringPath to expected file for exact matching
containslist[str]Content fragments to check (see prefix rules)
PrefixMeaning
+Fragment must be present in the file
-Fragment must not be present
(none)Fragment must be present (same as +)

Scoring: passed_checks / total_checks

- type: diff
name: api_updates
config:
expected_files:
- path: "src/api.py"
contains:
- "+from fastapi import FastAPI"
- "+@app.get('/health')"
- "-import flask"

Validates that the agent output is valid JSON conforming to a given schema. Supports both inline schemas and schema files.

- type: json_schema
name: api_response
config:
schema:
type: object
required: [status, data]
properties:
status:
type: string
enum: [success, error]
data:
type: object
OptionTypeDescription
schemaobjectInline JSON Schema definition
schema_filestringPath to a .json schema file

One of schema or schema_file is required.

- type: json_schema
name: validate_manifest
config:
schema_file: "schemas/manifest.schema.json"

Uses a second LLM to evaluate the agent’s work. The judge LLM calls set_waza_grade_pass or set_waza_grade_fail tool functions to render its verdict. This is the most flexible grader — it can assess quality, correctness, style, or anything you can describe in natural language.

- type: prompt
name: quality_check
config:
prompt: |
Review the agent's response. Check that the explanation is:
1. Technically accurate
2. Easy to understand
3. Includes code examples
If all criteria are met, call set_waza_grade_pass.
Otherwise, call set_waza_grade_fail with your reasoning.
model: "gpt-4o-mini"
OptionTypeDefaultDescription
promptstring(required)Instructions for the judge LLM
modelstring(required)Model to use for judging
continue_sessionboolfalseResume the agent’s session (judge sees full context)
  1. waza starts a new Copilot session (or resumes the agent’s session if continue_session: true).
  2. The judge receives your prompt plus two tool definitions: set_waza_grade_pass and set_waza_grade_fail.
  3. The judge calls one of the tools. If it calls set_waza_grade_pass, score is 1.0; if set_waza_grade_fail, score is 0.0.

Example: file review with continue_session

Section titled “Example: file review with continue_session”
- type: prompt
name: file_review
config:
prompt: |
Check that the files on disk are properly updated.
Verify the code compiles and follows best practices.
If correct, call set_waza_grade_pass.
If not, call set_waza_grade_fail with your reasoning.
model: "claude-sonnet-4.5"
continue_session: true

Validates agent behavior metrics — how many tool calls were made, token consumption, required/forbidden tools, and execution duration. Use this to enforce efficiency and safety guardrails.

- type: behavior
name: efficiency_check
config:
max_tool_calls: 15
max_tokens: 50000
max_duration_ms: 60000
required_tools:
- bash
- edit
forbidden_tools:
- rm
- sudo
OptionTypeDescription
max_tool_callsintMaximum allowed tool calls (0 = no limit)
max_tokensintMaximum total token usage (0 = no limit)
max_duration_msintMaximum execution time in ms (0 = no limit)
required_toolslist[str]Tool names that must be used
forbidden_toolslist[str]Tool names that must not be used

At least one option must be configured. Each configured rule counts as one check.

Scoring: passed_checks / total_checks


Validates the sequence of tool calls the agent made against an expected action path. Supports three matching modes for different levels of strictness.

- type: action_sequence
name: deploy_workflow
config:
matching_mode: in_order_match
expected_actions:
- bash
- edit
- bash
- git
OptionTypeDescription
expected_actionslist[str]The expected tool call sequence
matching_modestringHow to compare actual vs. expected (see below)
ModeDescription
exact_matchActual tool calls must exactly match the expected list (same tools, same order, same count)
in_order_matchExpected tools must appear in order, but extra tools between them are allowed
any_order_matchAll expected tools must appear, but order doesn’t matter

Scoring: F1 score computed from precision (correct calls / total actual) and recall (matched / total expected).


Validates which Copilot skills the agent invoked during a session. Useful for multi-skill orchestration testing — verifying the agent delegates to the right skills.

- type: skill_invocation
name: routing_check
config:
required_skills:
- azure-prepare
- azure-deploy
mode: in_order
allow_extra: true
OptionTypeDefaultDescription
required_skillslist[str](required)Skills that must be invoked
modestring(required)Matching mode: exact_match, in_order, or any_order
allow_extrabooltrueWhether extra skill invocations are penalized

When allow_extra: false, extra invocations beyond the required list reduce the score by up to 60%.

Scoring: F1 score (precision × recall), with optional penalty for extra invocations.


Validates which tools an agent should or shouldn’t use, and enforces turn and token limits. Reads from the session digest to check tool usage, turn counts, and total token consumption.

- type: tool_constraint
name: guardrails
config:
expect_tools:
- bash
- edit
reject_tools:
- rm
- sudo
max_turns: 10
max_tokens: 50000
OptionTypeDefaultDescription
expect_toolslist[str][]Tool names that must appear in the session
reject_toolslist[str][]Tool names that must not appear
max_turnsint0Maximum conversation turns (0 = no limit)
max_tokensint0Maximum total token usage (0 = no limit)

At least one constraint must be configured. Each configured rule counts as one check.

Scoring: passed_checks / total_checks

- type: tool_constraint
name: safety_check
config:
reject_tools:
- rm
- sudo
- kill
max_turns: 15
max_tokens: 100000
- type: tool_constraint
name: workflow_tools
config:
expect_tools:
- bash
- edit
- grep

Runs any external command to grade the agent output. The agent output is passed via stdin, and the workspace directory is available as the WAZA_WORKSPACE_DIR environment variable. Exit code 0 means pass (score 1.0); non-zero means fail (score 0.0).

- type: program
name: lint_check
config:
command: "python3"
args: ["scripts/grade.py"]
timeout: 60
OptionTypeDefaultDescription
commandstring(required)Program to execute
argslist[str][]Arguments passed to the program
timeoutint30Max execution time in seconds
- type: program
name: build_test
config:
command: "bash"
args: ["-c", "cd $WAZA_WORKSPACE_DIR && npm test"]
timeout: 120
#!/usr/bin/env python3
"""scripts/grade.py — reads agent output from stdin, exits 0 or 1."""
import sys, json, os
output = sys.stdin.read()
workspace = os.environ.get("WAZA_WORKSPACE_DIR", "")
# Check that a required file was created
if os.path.exists(os.path.join(workspace, "result.json")):
print("✓ result.json created")
sys.exit(0)
else:
print("✗ result.json missing")
sys.exit(1)

Defined at the top level of eval.yaml, applied to every task:

graders:
- type: text
name: no_errors
config:
regex_not_match:
- "(?i)fatal error|crashed|exception occurred"
- type: code
name: has_output
config:
assertions:
- "len(output) > 10"
tasks:
- task_files: ["tasks/*.yaml"]

Define graders globally, then reference them by name in individual tasks:

graders:
- type: text
name: format_check
config:
regex_match: ["^[A-Z]"]
- type: code
name: length_check
config:
assertions:
- "len(output) > 100"
tasks:
- id: task-001
inputs:
prompt: "Explain this code"
expected:
graders:
- format_check
- length_check

Define graders directly inside a task:

tasks:
- id: task-001
inputs:
prompt: "Create a REST API"
expected:
graders:
- type: file
name: api_files
config:
must_exist:
- "src/api.py"
- "requirements.txt"
- type: diff
name: api_content
config:
expected_files:
- path: "src/api.py"
contains:
- "+from fastapi import FastAPI"

By default every grader counts equally toward the composite score. Add a weight field to shift importance:

graders:
- type: text
name: critical_check
weight: 3.0 # Counts 3×
config:
regex_match: ["deployed"]
- type: text
name: nice_to_have
weight: 0.5 # Counts 0.5×
config:
contains: [summary]
- type: code
name: basic_length
# weight omitted → defaults to 1.0
config:
assertions:
- "len(output) > 50"
OptionTypeDefaultDescription
weightfloat1.0Relative importance of this grader in the composite score

Formula: (score₁ × weight₁ + score₂ × weight₂ + …) / (weight₁ + weight₂ + …)

With the config above and scores of 1.0, 0.0, and 1.0, the composite score is (1.0×3 + 0.0×0.5 + 1.0×1) / (3+0.5+1) = 0.89.


You can stack multiple graders on a single task. All graders run independently and each produces its own score. A task passes when all graders pass.

graders:
# 1. Output mentions key concepts
- type: text
name: concepts
config:
contains: [authentication, JWT, middleware]
# 2. No error patterns
- type: text
name: no_errors
config:
regex_not_match: ["(?i)error|exception"]
# 3. Required files exist
- type: file
name: deliverables
config:
must_exist: ["src/auth.ts", "src/middleware.ts"]
# 4. Agent was efficient
- type: behavior
name: efficiency
config:
max_tool_calls: 20
max_tokens: 40000
# 5. LLM judge confirms quality
- type: prompt
name: quality
config:
prompt: |
Review the implementation for security best practices.
Call set_waza_grade_pass if secure, set_waza_grade_fail if not.
model: gpt-4o-mini

Verify a skill activates on the right prompts and stays silent on the wrong ones:

tasks:
- id: should-trigger
inputs:
prompt: "Deploy my app to Azure"
expected:
graders:
- type: text
name: azure_response
config:
contains: [azure, deploy, resource]
- id: should-not-trigger
inputs:
prompt: "What's the weather today?"
expected:
graders:
- type: text
name: no_azure
config:
not_contains: [azure, deploy, bicep]

Test that the agent correctly modifies source files:

graders:
- type: file
name: files_created
config:
must_exist:
- "src/utils.ts"
- "tests/utils.test.ts"
- type: diff
name: correct_edits
config:
expected_files:
- path: "src/utils.ts"
contains:
- "+export function formatDate"
- "+export function parseConfig"
- path: "tests/utils.test.ts"
contains:
- "+describe('formatDate'"
- type: program
name: tests_pass
config:
command: bash
args: ["-c", "cd $WAZA_WORKSPACE_DIR && npm test --silent"]
timeout: 60

Verify the agent invokes the right skills in the right order:

graders:
- type: skill_invocation
name: correct_workflow
config:
required_skills:
- brainstorming
- azure-prepare
- azure-deploy
mode: in_order
allow_extra: false
- type: action_sequence
name: tool_usage
config:
matching_mode: in_order_match
expected_actions:
- bash
- create
- edit
- bash

  1. Start simple — Begin with keyword or regex graders, then add stricter graders as you identify failure modes.
  2. Layer your checks — Combine output graders (regex, keyword) with workspace graders (file, diff) and behavior graders for comprehensive coverage.
  3. Use descriptive nameschecks_auth_flow beats grader1. Names appear in the dashboard and CLI output.
  4. Use prompt for subjective quality — When you can’t express the check as a pattern or assertion, let an LLM judge it.
  5. Set behavior budgets — Use the behavior grader to catch runaway agents that burn too many tokens or tool calls.
  6. Test graders in isolation — Run a single task with waza run eval.yaml --task my-task -v to verify graders before running the full suite.
  7. Use program as an escape hatch — When you need full programmatic control, write a script in any language and use the program grader.