Skip to content

waza

The way of technique — measure, refine, master 🥋

Section titled “The way of technique — measure, refine, master 🥋”

Waza is a Go CLI tool for evaluating AI agent skills through structured benchmarks. Define test cases in YAML, run them against AI models, and validate results with pluggable validators.

Perfect for skill authors, platform teams, and developers building AI-powered applications.

Define Skills

Create comprehensive skill definitions with YAML-based evaluation specs and task fixtures.

Run Benchmarks

Execute evaluations against different AI models with fixture isolation and pluggable validators.

Compare Results

Cross-model comparison to measure skill effectiveness and track improvements.

Validate Quality

Use 11 built-in grader types to validate task completion, behavior, and quality metrics.

View Metrics

Interactive web dashboard with live results, trends, detailed analysis, and exports.

Integrate with CI/CD

Automated evaluation runs on pull requests with GitHub Actions integration.

Install waza in seconds:

Terminal window
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash

Or via azd extension:

Terminal window
azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
azd ext install microsoft.azd.waza

Then initialize your first eval suite:

Terminal window
waza init my-eval-suite
cd my-eval-suite
waza new my-skill
waza run my-skill -v
waza serve # View results in the dashboard

Define evaluation specs, tasks, and validation rules in simple YAML:

name: code-explainer-eval
description: Test agent's ability to explain code
config:
model: claude-sonnet-4.6
timeout_seconds: 300
graders:
- type: text
name: checks_logic
weight: 2.0
config:
pattern: "(?i)(function|logic|parameter)"
- type: code
name: has_output
config:
assertions:
- "len(output) > 100"
tasks:
- "tasks/*.yaml"

11 grader types built-in:

  • Code — Python assertions
  • Regex — Pattern matching
  • Keyword — Keyword presence
  • File — Output file checking
  • Diff — Diff comparison
  • JSON Schema — JSON structure validation
  • Prompt — LLM-powered evaluation
  • Behavior — Agent behavior validation
  • Action Sequence — Tool call sequence validation
  • Skill Invocation — Skill invocation validation
  • Program — Program execution validation

Each task gets a fresh temp workspace with fixtures copied in. Original fixtures never modified.

Test skills against multiple AI models:

Terminal window
waza run eval.yaml --model gpt-4o -o gpt4.json
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
waza compare gpt4.json sonnet.json

Serve interactive results:

Terminal window
waza serve
# opens http://localhost:3000 — view runs, comparisons, trends, and live updates

Integrated GitHub Actions support for automated evaluation runs.


Questions? Open an issue or check the GitHub repository.