Define Skills
Create comprehensive skill definitions with YAML-based evaluation specs and task fixtures.
Waza is a Go CLI tool for evaluating AI agent skills through structured benchmarks. Define test cases in YAML, run them against AI models, and validate results with pluggable validators.
Perfect for skill authors, platform teams, and developers building AI-powered applications.
Define Skills
Create comprehensive skill definitions with YAML-based evaluation specs and task fixtures.
Run Benchmarks
Execute evaluations against different AI models with fixture isolation and pluggable validators.
Compare Results
Cross-model comparison to measure skill effectiveness and track improvements.
Validate Quality
Use 11 built-in grader types to validate task completion, behavior, and quality metrics.
View Metrics
Interactive web dashboard with live results, trends, detailed analysis, and exports.
Integrate with CI/CD
Automated evaluation runs on pull requests with GitHub Actions integration.
Install waza in seconds:
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashOr via azd extension:
azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.jsonazd ext install microsoft.azd.wazaThen initialize your first eval suite:
waza init my-eval-suitecd my-eval-suitewaza new my-skillwaza run my-skill -vwaza serve # View results in the dashboardDefine evaluation specs, tasks, and validation rules in simple YAML:
name: code-explainer-evaldescription: Test agent's ability to explain code
config: model: claude-sonnet-4.6 timeout_seconds: 300
graders: - type: text name: checks_logic weight: 2.0 config: pattern: "(?i)(function|logic|parameter)" - type: code name: has_output config: assertions: - "len(output) > 100"
tasks: - "tasks/*.yaml"11 grader types built-in:
Each task gets a fresh temp workspace with fixtures copied in. Original fixtures never modified.
Test skills against multiple AI models:
waza run eval.yaml --model gpt-4o -o gpt4.jsonwaza run eval.yaml --model claude-sonnet-4.6 -o sonnet.jsonwaza compare gpt4.json sonnet.jsonServe interactive results:
waza serve# opens http://localhost:3000 — view runs, comparisons, trends, and live updatesIntegrated GitHub Actions support for automated evaluation runs.
Questions? Open an issue or check the GitHub repository.