Skip to content

About

Waza (技 - Japanese for “skill/technique”) is a unified CLI platform for creating, testing, and evaluating AI agent skills.

It consolidates existing skill development tools into a single binary that provides the complete developer experience for creating, testing, and improving AI agent skills across any domain or platform.

Creating, testing, and evaluating AI agent skills lacks consistent tooling:

  • Automated compliance validation — No standardized scoring to ensure skill quality
  • Trigger testing — Manual verification of skill activation patterns
  • Cross-model evaluation — No framework for testing skills across GPT-4o, Claude, etc.
  • Token budget enforcement — Guidelines exist but aren’t automatically checked

Waza automates the skill development workflow:

PhaseCapability
ScaffoldGenerate a compliant skill structure ready for evaluation
DevelopIterate with real-time compliance scoring
TestRun agentic test loops with real LLM execution
EvaluateCross-model comparison with comprehensive metrics

Waza is built in Go for:

  • Single-binary distribution (no dependencies)
  • Fast execution
  • Cross-platform compatibility (Linux, macOS, Windows)
waza/
├── cmd/waza/ # CLI entrypoint
├── internal/
│ ├── config/ # Configuration
│ ├── execution/ # Agent engines
│ ├── models/ # Data structures
│ ├── orchestration/ # Test runner
│ └── scoring/ # Validators
├── web/ # Dashboard (React + Tailwind)
└── examples/ # Example evals
  1. Fixture Isolation — Each task gets a fresh temp workspace. Original fixtures never modified.
  2. Pluggable Validators — 11 grader types, easily extended
  3. Cross-Model Support — Test skills against multiple LLM providers
  4. Local-First — Mock executor for development, real API for CI/CD
  5. Observability — Full transcripts, detailed metrics, dashboard visualization

Define test cases in YAML:

name: code-explainer-eval
tasks:
- "tasks/*.yaml"
graders:
- type: text
config:
regex_match: ["function", "parameter"]
  • Code — Python assertions
  • Text — Text matching
  • File — Output file checking
  • Diff — Diff comparison
  • JSON Schema — JSON structure validation
  • Prompt — LLM-powered evaluation
  • Behavior — Agent behavior validation
  • Action Sequence — Tool call sequence validation
  • Skill Invocation — Skill invocation validation
  • Program — Program execution validation
Terminal window
waza run eval.yaml --model gpt-4o -o gpt4.json
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
waza compare gpt4.json sonnet.json
Terminal window
waza serve

Explore results, trends, and comparisons in a web interface.

Pre-configured GitHub Actions workflow:

Terminal window
waza init my-project
# Creates .github/workflows/eval.yml

We welcome contributions! Here’s how to get involved:

Found a bug or have a feature request?

  1. Check existing issues
  2. Open a new issue with clear reproduction steps
  3. Include error logs and environment details
  1. Fork the repository
  2. Create a branch for your feature: git checkout -b feature/my-feature
  3. Make changes following the code style
  4. Write tests for new functionality
  5. Run linter and tests:
    Terminal window
    make lint
    make test
  6. Commit with clear messages: feat: Add my feature
  7. Push to your fork and open a pull request
Terminal window
# Clone repository
git clone https://github.com/microsoft/waza.git
cd waza
# Set up Go environment
go version # Requires 1.26+
# Build
make build
# Test
make test
# Lint
make lint
# Run
./waza --help
  • Follow Go idioms (Effective Go)
  • Use gofmt for formatting
  • Include unit tests for new code
  • Document public functions

To add a new grader type:

  1. Implement Validator interface in internal/scoring/
  2. Register in ValidatorRegistry
  3. Add tests
  4. Document in README

Example:

type MyValidator struct {
Config interface{} `json:"config"`
}
func (v *MyValidator) Grade(ctx *models.GradeContext) (*models.ValidationResult, error) {
// Implementation
}
  • Update relevant docs when changing behavior
  • Add examples for new features
  • Keep README.md and GUIDE.md in sync

Have questions or ideas? Join the conversation:

  • ✅ waza run — Execute benchmarks
  • ✅ waza init — Scaffold projects
  • ✅ waza new — Create skills
  • ✅ waza compare — Cross-model comparison
  • ✅ All 11 grader types
  • ✅ waza check — Compliance scoring
  • ✅ waza dev — Iterative improvement
  • 🟡 Token budget optimization
  • 🟡 Trigger accuracy testing

E3: Evaluation Framework (🟡 In Progress)

Section titled “E3: Evaluation Framework (🟡 In Progress)”
  • ✅ Multi-model testing
  • ✅ Comprehensive metrics
  • 🟡 Statistical analysis
  • 🟡 LLM-powered suggestions
  • Token counting across models
  • Budget enforcement
  • Optimization recommendations
  • Conversational skill interface
  • Interactive development
  • ✅ GitHub Actions workflow
  • ✅ Artifact handling
  • ✅ PR comments
  • ✅ Azure Developer CLI integration
  • ✅ Registry publishing

MIT License

Shayne Boyer (@spboyer)

Maintained by the waza team and community contributors.


Questions? Open an issue or start a discussion.


The waterfall timeline visualization in the waza dashboard was inspired by the .NET Aspire distributed application dashboard, which provides a similar trace/span view for distributed systems observability.