waza

The way of technique — measure, refine, master 🥋

Waza is a Go CLI tool for evaluating AI agent skills through structured benchmarks. Define test cases in YAML, run them against AI models, and validate results with pluggable validators.

Perfect for skill authors, platform teams, and developers building AI-powered applications.

What can waza do?

Define Skills

Create comprehensive skill definitions with YAML-based evaluation specs and task fixtures.

Run Benchmarks

Execute evaluations against different AI models with fixture isolation and pluggable validators.

Compare Results

Cross-model comparison to measure skill effectiveness and track improvements.

Validate Quality

Use 11 built-in grader types to validate task completion, behavior, and quality metrics.

View Metrics

Interactive web dashboard with live results, trends, detailed analysis, and exports.

Integrate with CI/CD

Automated evaluation runs on pull requests with GitHub Actions integration.

Quick Start

Install waza in seconds:

curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash

Or via azd extension:

azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
azd ext install microsoft.azd.waza

Then initialize your first eval suite:

waza init my-eval-suite
cd my-eval-suite
waza new skill my-skill
waza run my-skill -v
waza serve  # View results in the dashboard

Get Started

Key Features

Structured YAML Benchmarks

Define evaluation specs, tasks, and validation rules in simple YAML:

name: code-explainer-eval
description: Test agent's ability to explain code

config:
  model: claude-sonnet-4.6
  timeout_seconds: 300

graders:
  - type: text
    name: checks_logic
    weight: 2.0
    config:
      pattern: "(?i)(function|logic|parameter)"
  - type: code
    name: has_output
    config:
      assertions:
        - "len(output) > 100"

tasks:
  - "tasks/*.yaml"

Pluggable Validators

11 grader types built-in:

Code — Python assertions
Regex — Pattern matching
Keyword — Keyword presence
File — Output file checking
Diff — Diff comparison
JSON Schema — JSON structure validation
Prompt — LLM-powered evaluation
Behavior — Agent behavior validation
Action Sequence — Tool call sequence validation
Skill Invocation — Skill invocation validation
Program — Program execution validation

Fixture Isolation

Each task gets a fresh temp workspace with fixtures copied in. Original fixtures never modified.

Cross-Model Comparison

Test skills against multiple AI models:

waza run eval.yaml --model gpt-4o -o gpt4.json
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
waza compare gpt4.json sonnet.json

Web Dashboard

Serve interactive results:

waza serve
# opens http://localhost:3000 — view runs, comparisons, trends, and live updates

CI/CD Ready

Integrated GitHub Actions support for automated evaluation runs.

Questions? Open an issue or check the GitHub repository.