Graders

Graders are the scoring engine behind every waza evaluation. After an agent executes a task, one or more graders inspect the result and produce a verdict.

How graders work

The agent runs a task inside an isolated workspace.
waza collects the output text, transcript (tool calls, events), session digest (token counts, tools used), and the workspace directory (files on disk).
Each grader receives that context and returns a result:

Field	Type	Description
`score`	`float`	0.0 – 1.0 (proportion of checks passed)
`passed`	`bool`	Whether the grader considers the task successful
`feedback`	`string`	Human-readable explanation
`details`	`object`	Structured metadata for debugging

You can attach graders globally (applied to every task) or per-task in your eval YAML. Each grader also accepts an optional weight field that controls its influence on the composite score (see Weighted Scoring below).

At a glance

waza ships with several built-in grader types. Pick the right one for the job:

Type	YAML key	What it checks
Inline Script	`code`	Python/JS assertion expressions against output
Text	`text`	Text and regex matching against output
File	`file`	File existence and content patterns in workspace
Diff	`diff`	Workspace files vs. expected snapshots or fragments
JSON Schema	`json_schema`	Output validates against a JSON Schema
Prompt (LLM-as-judge)	`prompt`	A second LLM grades the result
Behavior	`behavior`	Agent metrics — tool calls, tokens, duration
Action Sequence	`action_sequence`	Tool call ordering and completeness
Skill Invocation	`skill_invocation`	Which skills were invoked and in what order
Program	`program`	External command (any language) grades via exit code
Tool Constraint	`tool_constraint`	Validate tool usage constraints (expect/reject lists and argument patterns)
Tool Calls	`tool_calls`	Validate required/forbidden tools and call count bounds
Trigger	`trigger`	Heuristic grader for validating whether a prompt should activate a skill

Inline Script (`code`)

Evaluates Python or JavaScript assertion expressions against the execution context. Each assertion is a one-liner that must evaluate to True.

- type: code
  name: output_quality
  config:
    language: python # or "javascript" — default is python
    assertions:
      - "len(output) > 100"
      - "'function' in output.lower()"
      - "len(transcript) > 0"

Context variables

Variable	Type	Description
`output`	`str`	Agent’s final text output
`outcome`	`dict`	Structured outcome state
`transcript`	`list`	Full execution transcript events
`tool_calls`	`list`	Tool calls extracted from transcript
`errors`	`list`	Errors from transcript
`duration_ms`	`int`	Execution wall-clock time

Built-in functions: len, any, all, str, int, float, bool, list, dict, re

Scoring: passed_assertions / total_assertions

Python’s eval() does not support generator expressions in a restricted scope.

# ❌ Fails at runtime
assertions:
  - "any(kw in output for kw in ['azure', 'deploy'])"

# ✅ Use explicit boolean chains instead
assertions:
  - "'azure' in output.lower() or 'deploy' in output.lower()"

JavaScript example

- type: code
  name: js_checks
  config:
    language: javascript
    assertions:
      - "output.length > 50"
      - "output.includes('hello')"

Text

Validates the agent output using substring matching and regex patterns. Supports case-insensitive and case-sensitive substring checks, plus regex pattern matching.

- type: text
  name: format_checker
  config:
    contains:
      - "deployed to"
      - "Resource group"
    not_contains:
      - "permission denied"
    regex_match:
      - "https?://.+"
    regex_not_match:
      - "(?i)error|failed|exception"

Option	Type	Description
`contains`	`list[str]`	Substrings that must appear (case-insensitive)
`not_contains`	`list[str]`	Substrings that must not appear (case-insensitive)
`contains_cs`	`list[str]`	Substrings that must appear (case-sensitive)
`not_contains_cs`	`list[str]`	Substrings that must not appear (case-sensitive)
`regex_match`	`list[str]`	Regex patterns that must match in output
`regex_not_match`	`list[str]`	Regex patterns that must not match

Scoring: passed_checks / total_checks

Example: code quality gate

- type: text
  name: code_quality
  config:
    contains:
      - "def "
      - "return"
    not_contains:
      - "TODO"
      - "FIXME"
    regex_match:
      - "def \\w+\\(.*\\):" # Has function definitions
    regex_not_match:
      - "print\\(" # No debug prints

File

Validates file existence and content patterns in the agent’s workspace directory. Use when the agent creates, modifies, or should avoid certain files.

- type: file
  name: project_structure
  config:
    must_exist:
      - "src/index.ts"
      - "package.json"
      - "tsconfig.json"
    must_not_exist:
      - "node_modules/"
      - ".env"
    content_patterns:
      - path: "package.json"
        must_match:
          - '"name":\\s*"my-app"'
        must_not_match:
          - '"version":\\s*"0\\.0\\.0"'

Option	Type	Description
`must_exist`	`list[str]`	Workspace-relative paths that must be present
`must_not_exist`	`list[str]`	Paths that must not be present
`content_patterns`	`list`	Regex checks against file contents (see below)

Each content_patterns entry:

Field	Type	Description
`path`	`string`	Workspace-relative file path
`must_match`	`list[str]`	Regex patterns the file content must match
`must_not_match`	`list[str]`	Regex patterns the file must not match

Diff

Compares post-execution workspace files against expected snapshots or content fragments. Ideal for testing file-editing tasks where you know the expected output.

- type: diff
  name: code_edits
  config:
    expected_files:
      - path: "src/main.py"
        contains:
          - "+def new_function():"
          - "+    return 42"
          - "-def old_function():"
      - path: "README.md"
        snapshot: "expected/README.md"

Each expected_files entry supports:

Field	Type	Description
`path`	`string`	Workspace-relative file path (required)
`snapshot`	`string`	Path to expected file for exact matching
`contains`	`list[str]`	Content fragments to check (see prefix rules)

Contains prefix rules

Prefix	Meaning
`+`	Fragment must be present in the file
`-`	Fragment must not be present
(none)	Fragment must be present (same as `+`)

Scoring: passed_checks / total_checks

Fragment checking
Snapshot matching

- type: diff
  name: api_updates
  config:
    expected_files:
      - path: "src/api.py"
        contains:
          - "+from fastapi import FastAPI"
          - "+@app.get('/health')"
          - "-import flask"

- type: diff
  name: exact_config
  config:
    expected_files:
      - path: "config.json"
        snapshot: "snapshots/expected_config.json"

JSON Schema (`json_schema`)

Validates that the agent output is valid JSON conforming to a given schema. Supports both inline schemas and schema files.

- type: json_schema
  name: api_response
  config:
    schema:
      type: object
      required: [status, data]
      properties:
        status:
          type: string
          enum: [success, error]
        data:
          type: object

Option	Type	Description
`schema`	`object`	Inline JSON Schema definition
`schema_file`	`string`	Path to a `.json` schema file

One of schema or schema_file is required.

Example: schema file

- type: json_schema
  name: validate_manifest
  config:
    schema_file: "schemas/manifest.schema.json"

Prompt (LLM-as-judge)

Uses a second LLM to evaluate the agent’s work. The judge LLM calls set_waza_grade_pass or set_waza_grade_fail tool functions to render its verdict. This is the most flexible grader — it can assess quality, correctness, style, or anything you can describe in natural language.

This is the closest Waza match for OpenAI Evals modelgraded specs. See the eval YAML guide for concrete fact.yaml, closedqa.yaml, and battle.yaml translations.

- type: prompt
  name: quality_check
  config:
    prompt: |
      Review the agent's response. Check that the explanation is:
      1. Technically accurate
      2. Easy to understand
      3. Includes code examples

      If all criteria are met, call set_waza_grade_pass.
      Otherwise, call set_waza_grade_fail with your reasoning.
    model: "gpt-4o-mini"

Option	Type	Default	Description
`prompt`	`string`	(required¹)	Instructions for the judge LLM
`rubric`	`string`	(optional)	Reusable rubric by name (e.g. `groundedness`) or path to a local `.md` file. See Rubric library.
`model`	`string`	(required)	Model to use for judging
`continue_session`	`bool`	`false`	Resume the agent’s session (judge sees full context)
`mode`	`string`	`independent`	Judging mode: `independent` (standard) or `pairwise` (compare two, requires `--baseline`)

¹ Either prompt or rubric is required. When only rubric is set, the rubric’s body becomes the judge prompt. When both are set, the inline prompt wins (useful for ad-hoc overrides while keeping the rubric metadata attached for reporting).

Using a reusable rubric

- type: prompt
  name: is_grounded
  config:
    rubric: groundedness   # built-in rubric by name
    model: "gpt-4o-mini"

- type: prompt
  name: house_style
  config:
    rubric: ./rubrics/my-house-style.md   # local file
    model: "gpt-4o-mini"

The rubric markdown ships with a YAML frontmatter (name, version, scale, description, optional goldens) so each rubric is versioned, self-documenting, and testable. See Rubric library for the full schema and the list of built-in rubrics.

How it works

waza starts a new Copilot session (or resumes the agent’s session if continue_session: true).
The judge receives your prompt plus two tool definitions: set_waza_grade_pass and set_waza_grade_fail.
The judge calls one of the tools. If it calls set_waza_grade_pass, score is 1.0; if set_waza_grade_fail, score is 0.0.

Example: file review with continue_session

- type: prompt
  name: file_review
  config:
    prompt: |
      Check that the files on disk are properly updated.
      Verify the code compiles and follows best practices.
      If correct, call set_waza_grade_pass.
      If not, call set_waza_grade_fail with your reasoning.
    model: "claude-sonnet-4.5"
    continue_session: true

Behavior

Validates agent behavior metrics — how many tool calls were made, token consumption, required/forbidden tools, and execution duration. Use this to enforce efficiency and safety guardrails.

- type: behavior
  name: efficiency_check
  config:
    max_tool_calls: 15
    max_tokens: 50000
    max_duration_ms: 60000
    required_tools:
      - bash
      - edit
    forbidden_tools:
      - rm
      - sudo

Option	Type	Description
`max_tool_calls`	`int`	Maximum allowed tool calls (0 = no limit)
`max_tokens`	`int`	Maximum total token usage (0 = no limit)
`max_duration_ms`	`int`	Maximum execution time in ms (0 = no limit)
`required_tools`	`list[str]`	Tool names that must be used
`forbidden_tools`	`list[str]`	Tool names that must not be used

At least one option must be configured. Each configured rule counts as one check.

Scoring: passed_checks / total_checks

Action Sequence (`action_sequence`)

Validates the sequence of tool calls the agent made against an expected action path. Supports three matching modes for different levels of strictness.

- type: action_sequence
  name: deploy_workflow
  config:
    matching_mode: in_order_match
    expected_actions:
      - bash
      - edit
      - bash
      - git

Option	Type	Description
`expected_actions`	`list[str]`	The expected tool call sequence
`matching_mode`	`string`	How to compare actual vs. expected (see below)

Matching modes

Mode	Description
`exact_match`	Actual tool calls must exactly match the expected list (same tools, same order, same count)
`in_order_match`	Expected tools must appear in order, but extra tools between them are allowed
`any_order_match`	All expected tools must appear, but order doesn’t matter

Scoring: F1 score computed from precision (correct calls / total actual) and recall (matched / total expected).

Skill Invocation (`skill_invocation`)

Validates which Copilot skills the agent invoked during a session. Useful for multi-skill orchestration testing — verifying the agent delegates to the right skills, or that a specific skill was not invoked.

- type: skill_invocation
  name: routing_check
  config:
    required_skills:
      - azure-prepare
      - azure-deploy
    mode: in_order
    allow_extra: true

Option	Type	Default	Description
`required_skills`	`list[str]`	`[]`	Skills that must be invoked
`forbidden_skills`	`list[str]`	`[]`	Skills that must not be invoked
`mode`	`string`	—	Matching mode for `required_skills`: `exact_match`, `in_order`, or `any_order`
`allow_extra`	`bool`	`true`	Whether extra skill invocations beyond `required_skills` are penalized

At least one of required_skills or forbidden_skills must be non-empty. mode is required when required_skills is set and ignored for forbidden-only configs.

When allow_extra: false, extra invocations beyond the required list reduce the score by up to 60%.

For forbidden-only configs, unrelated skill invocations are allowed and the grader scores 1.0 unless a forbidden skill appears. For mixed required and forbidden configs, forbidden skill use fails the grader with a score of 0.0.

- type: skill_invocation
  name: no_prod_deploy
  config:
    forbidden_skills:
      - azure-prod-deploy
    allow_extra: true

Scoring: F1 score (precision × recall) for required skills, with optional penalty for extra invocations. Forbidden-only configs score 1.0 when all forbidden skills are absent and 0.0 when any are present.

Tool Constraint (`tool_constraint`)

Validates which tools an agent should or shouldn’t use, and enforces turn and token limits. Reads from the session digest to check tool usage, turn counts, and total token consumption.

- type: tool_constraint
  name: guardrails
  config:
    expect_tools:
      - bash
      - edit
    reject_tools:
      - rm
      - sudo
    max_turns: 10
    max_tokens: 50000

Option	Type	Default	Description
`expect_tools`	`list[str]`	`[]`	Tool names that must appear in the session
`reject_tools`	`list[str]`	`[]`	Tool names that must not appear
`max_turns`	`int`	`0`	Maximum conversation turns (0 = no limit)
`max_tokens`	`int`	`0`	Maximum total token usage (0 = no limit)

At least one constraint must be configured. Each configured rule counts as one check.

Scoring: passed_checks / total_checks

Example: safety guardrails

- type: tool_constraint
  name: safety_check
  config:
    reject_tools:
      - rm
      - sudo
      - kill
    max_turns: 15
    max_tokens: 100000

Example: required workflow tools

- type: tool_constraint
  name: workflow_tools
  config:
    expect_tools:
      - bash
      - edit
      - grep

Tool Calls (`tool_calls`)

Validates which tools an agent called during execution. Supports required tools, forbidden tools, and call-count bounds — all in one grader. Each configured constraint counts as one check; partial credit is awarded.

- type: tool_calls
  name: tool_usage
  config:
    required_tools:
      - bash
      - edit
    forbidden_tools:
      - rm
      - sudo
    min_calls: 2
    max_calls: 20

Option	Type	Default	Description
`required_tools`	`list[str]`	`[]`	Tool names that must appear in the session
`forbidden_tools`	`list[str]`	`[]`	Tool names that must not appear
`min_calls`	`int`	`0`	Minimum total tool calls required (0 = no min)
`max_calls`	`int`	`0`	Maximum total tool calls allowed (0 = no limit)

At least one constraint must be configured. When both min_calls and max_calls are set, min_calls must be ≤ max_calls.

Scoring: passed_checks / total_checks

Example: enforce safe tool usage

- type: tool_calls
  name: safety_check
  config:
    forbidden_tools:
      - rm
      - sudo
    max_calls: 30

Example: verify minimum workflow

- type: tool_calls
  name: workflow_check
  config:
    required_tools:
      - bash
      - edit
    min_calls: 3

Argument matchers (schema 1.1+)

When you need to assert what a tool was called with — not just which tools ran — use the expect: field with structured argument matchers. Each entry names a tool and a map of args: constraints; the grader checks every entry against the recorded tool calls for the run (the same data that is also serialized into tool_events[] in results.json).

- type: tool_calls
  name: edit_correctness
  config:
    expect:
      - tool: edit
        args:
          path: { equals: "src/auth.go" }
          file_text: { contains: "func Login" }
      - tool: bash
        args:
          command: { regex: "^go test " }

Each matcher is a single-key mapping naming exactly one of the kinds below:

Key	Matches when …
`equals`	the argument is deeply equal to the supplied value
`regex`	the stringified argument matches the supplied RE2 pattern
`contains`	the stringified argument contains the supplied substring
`range`	the numeric argument satisfies the bounds (`gte` / `lte` / `gt` / `lt`)
`json_schema`	the argument validates against the supplied JSON Schema (draft-07+)

The range matcher accepts any combination of inclusive (gte, lte) and exclusive (gt, lt) bounds; at least one must be set:

limit: { range: { gte: 1, lte: 5 } }
score: { range: { gt: 0.0 } }

tool_constraint accepts the same args: block on each expect_tools entry, so you can require both which tool ran and how it was called in a single grader:

- type: tool_constraint
  name: guardrails
  config:
    expect_tools:
      - tool: edit
        args:
          path: { regex: "^src/" }
      - tool: bash
    reject_tools:
      - tool: bash
        command_pattern: "^rm -rf"

Argument matchers apply on top of the existing required/forbidden checks — a tool must still be present (or absent) and also satisfy its matchers to count as a pass.

Program

Runs any external command to grade the agent output. The agent output is passed via stdin, and the workspace directory is available as the WAZA_WORKSPACE_DIR environment variable. Exit code 0 means pass (score 1.0); non-zero means fail (score 0.0).

- type: program
  name: lint_check
  config:
    command: "python3"
    args: ["scripts/grade.py"]
    timeout: 60

Option	Type	Default	Description
`command`	`string`	(required)	Program to execute
`args`	`list[str]`	`[]`	Arguments passed to the program
`timeout`	`int`	`30`	Max execution time in seconds

Example: shell script grader

- type: program
  name: build_test
  config:
    command: "bash"
    args: ["-c", "cd $WAZA_WORKSPACE_DIR && npm test"]
    timeout: 120

Example: custom Python grader

#!/usr/bin/env python3
"""scripts/grade.py — reads agent output from stdin, exits 0 or 1."""
import sys, json, os

output = sys.stdin.read()
workspace = os.environ.get("WAZA_WORKSPACE_DIR", "")

# Check that a required file was created
if os.path.exists(os.path.join(workspace, "result.json")):
    print("✓ result.json created")
    sys.exit(0)
else:
    print("✗ result.json missing")
    sys.exit(1)

Trigger

Heuristic grader for validating whether a prompt should activate a skill.

- type: trigger
  name: deploy_trigger
  config:
    skill_path: skills/azure-deploy/SKILL.md
    mode: positive
    threshold: 0.6

Option	Type	Description
`skill_path`	string	(required) Path to the `SKILL.md` file or the skill directory containing it.
`mode`	string	Either `positive` or `negative`
`threshold`	number	Score threshold between `0.0` and `1.0`. Default is `0.6`.

For more information, see the triggers grader documentation

Using graders in eval YAML

Global graders

Defined at the top level of eval.yaml, applied to every task:

graders:
  - type: text
    name: no_errors
    config:
      regex_not_match:
        - "(?i)fatal error|crashed|exception occurred"

  - type: code
    name: has_output
    config:
      assertions:
        - "len(output) > 10"

tasks:
  - task_files: ["tasks/*.yaml"]

Per-task graders (by reference)

Define graders globally, then reference them by name in individual tasks:

graders:
  - type: text
    name: format_check
    config:
      regex_match: ["^[A-Z]"]

  - type: code
    name: length_check
    config:
      assertions:
        - "len(output) > 100"

tasks:
  - id: task-001
    inputs:
      prompt: "Explain this code"
    expected:
      graders:
        - format_check
        - length_check

Per-task graders (inline)

Define graders directly inside a task:

tasks:
  - id: task-001
    inputs:
      prompt: "Create a REST API"
    expected:
      graders:
        - type: file
          name: api_files
          config:
            must_exist:
              - "src/api.py"
              - "requirements.txt"
        - type: diff
          name: api_content
          config:
            expected_files:
              - path: "src/api.py"
                contains:
                  - "+from fastapi import FastAPI"

Weighted Scoring

By default every grader counts equally toward the composite score. Add a weight field to shift importance:

graders:
  - type: text
    name: critical_check
    weight: 3.0 # Counts 3×
    config:
      regex_match: ["deployed"]

  - type: text
    name: nice_to_have
    weight: 0.5 # Counts 0.5×
    config:
      contains: [summary]

  - type: code
    name: basic_length
    # weight omitted → defaults to 1.0
    config:
      assertions:
        - "len(output) > 50"

Option	Type	Default	Description
`weight`	`float`	`1.0`	Relative importance of this grader in the composite score

Formula: (score₁ × weight₁ + score₂ × weight₂ + …) / (weight₁ + weight₂ + …)

With the config above and scores of 1.0, 0.0, and 1.0, the composite score is (1.0×3 + 0.0×0.5 + 1.0×1) / (3+0.5+1) = 0.89.

Combining graders

You can stack multiple graders on a single task. All graders run independently and each produces its own score. A task passes when all graders pass.

graders:
  # 1. Output mentions key concepts
  - type: text
    name: concepts
    config:
      contains: [authentication, JWT, middleware]

  # 2. No error patterns
  - type: text
    name: no_errors
    config:
      regex_not_match: ["(?i)error|exception"]

  # 3. Required files exist
  - type: file
    name: deliverables
    config:
      must_exist: ["src/auth.ts", "src/middleware.ts"]

  # 4. Agent was efficient
  - type: behavior
    name: efficiency
    config:
      max_tool_calls: 20
      max_tokens: 40000

  # 5. LLM judge confirms quality
  - type: prompt
    name: quality
    config:
      prompt: |
        Review the implementation for security best practices.
        Call set_waza_grade_pass if secure, set_waza_grade_fail if not.
      model: gpt-4o-mini

Real-world examples

Skill trigger accuracy

Verify a skill activates on the right prompts and stays silent on the wrong ones:

tasks:
  - id: should-trigger
    inputs:
      prompt: "Deploy my app to Azure"
    expected:
      graders:
        - type: text
          name: azure_response
          config:
            contains: [azure, deploy, resource]

  - id: should-not-trigger
    inputs:
      prompt: "What's the weather today?"
    expected:
      graders:
        - type: text
          name: no_azure
          config:
            not_contains: [azure, deploy, bicep]

Code editing task

Test that the agent correctly modifies source files:

graders:
  - type: file
    name: files_created
    config:
      must_exist:
        - "src/utils.ts"
        - "tests/utils.test.ts"

  - type: diff
    name: correct_edits
    config:
      expected_files:
        - path: "src/utils.ts"
          contains:
            - "+export function formatDate"
            - "+export function parseConfig"
        - path: "tests/utils.test.ts"
          contains:
            - "+describe('formatDate'"

  - type: program
    name: tests_pass
    config:
      command: bash
      args: ["-c", "cd $WAZA_WORKSPACE_DIR && npm test --silent"]
      timeout: 60

Multi-skill orchestration

Verify the agent invokes the right skills in the right order:

graders:
  - type: skill_invocation
    name: correct_workflow
    config:
      required_skills:
        - brainstorming
        - azure-prepare
        - azure-deploy
      mode: in_order
      allow_extra: false

  - type: action_sequence
    name: tool_usage
    config:
      matching_mode: in_order_match
      expected_actions:
        - bash
        - create
        - edit
        - bash

Best practices

Start simple — Begin with keyword or regex graders, then add stricter graders as you identify failure modes.
Layer your checks — Combine output graders (regex, keyword) with workspace graders (file, diff) and behavior graders for comprehensive coverage.
Use descriptive names — checks_auth_flow beats grader1. Names appear in the dashboard and CLI output.
Use prompt for subjective quality — When you can’t express the check as a pattern or assertion, let an LLM judge it.
Set behavior budgets — Use the behavior grader to catch runaway agents that burn too many tokens or tool calls.
Test graders in isolation — Run a single task with waza run eval.yaml --task my-task -v to verify graders before running the full suite.
Use program as an escape hatch — When you need full programmatic control, write a script in any language and use the program grader.

Next steps

Writing Eval Specs — Task and fixture configuration
Web Dashboard — Visualize grader results
CLI Reference — All commands and flags

Graders

How graders work

At a glance

Inline Script (code)

Context variables

JavaScript example

Text

Example: code quality gate

File

Diff

Contains prefix rules

JSON Schema (json_schema)

Example: schema file

Prompt (LLM-as-judge)

Using a reusable rubric

How it works

Example: file review with continue_session

Behavior

Action Sequence (action_sequence)

Matching modes

Skill Invocation (skill_invocation)

Tool Constraint (tool_constraint)

Example: safety guardrails

Example: required workflow tools

Tool Calls (tool_calls)

Example: enforce safe tool usage

Example: verify minimum workflow

Argument matchers (schema 1.1+)

Program

Example: shell script grader

Example: custom Python grader

Trigger

Using graders in eval YAML

Global graders

Per-task graders (by reference)

Per-task graders (inline)

Weighted Scoring

Combining graders

Real-world examples

Skill trigger accuracy

Code editing task

Multi-skill orchestration

Best practices

Next steps

Inline Script (`code`)

JSON Schema (`json_schema`)

Action Sequence (`action_sequence`)

Skill Invocation (`skill_invocation`)

Tool Constraint (`tool_constraint`)

Tool Calls (`tool_calls`)