About

What is Waza?

Waza (技 - Japanese for “skill/technique”) is a unified CLI platform for creating, testing, and evaluating AI agent skills.

It consolidates existing skill development tools into a single binary that provides the complete developer experience for creating, testing, and improving AI agent skills across any domain or platform.

The Problem

Creating, testing, and evaluating AI agent skills lacks consistent tooling:

Automated compliance validation — No standardized scoring to ensure skill quality
Trigger testing — Manual verification of skill activation patterns
Cross-model evaluation — No framework for testing skills across GPT-4o, Claude, etc.
Token budget enforcement — Guidelines exist but aren’t automatically checked

The Solution

Waza automates the skill development workflow:

Phase	Capability
Scaffold	Generate a compliant skill structure ready for evaluation
Develop	Iterate with real-time compliance scoring
Test	Run agentic test loops with real LLM execution
Evaluate	Cross-model comparison with comprehensive metrics

Architecture

Waza is built in Go for:

Single-binary distribution (no dependencies)
Fast execution
Cross-platform compatibility (Linux, macOS, Windows)

Components

waza/
├── cmd/waza/              # CLI entrypoint
├── internal/
│   ├── config/            # Configuration
│   ├── execution/         # Agent engines
│   ├── models/            # Data structures
│   ├── orchestration/     # Test runner
│   └── scoring/           # Validators
├── web/                   # Dashboard (React + Tailwind)
└── examples/              # Example evals

Design Principles

Fixture Isolation — Each task gets a fresh temp workspace. Original fixtures never modified.
Pluggable Validators — 11 grader types, easily extended
Cross-Model Support — Test skills against multiple LLM providers
Local-First — Mock executor for development, real API for CI/CD
Observability — Full transcripts, detailed metrics, dashboard visualization

Key Features

Structured Benchmarks

Define test cases in YAML:

name: code-explainer-eval
tasks:
  - "tasks/*.yaml"

graders:
  - type: text
    config:
      regex_match: ["function", "parameter"]

11 Built-In Validators

Code — Python assertions
Text — Text matching
File — Output file checking
Diff — Diff comparison
JSON Schema — JSON structure validation
Prompt — LLM-powered evaluation
Behavior — Agent behavior validation
Action Sequence — Tool call sequence validation
Skill Invocation — Skill invocation validation
Program — Program execution validation

Multi-Model Comparison

waza run eval.yaml --model gpt-4o -o gpt4.json
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
waza compare gpt4.json sonnet.json

Interactive Dashboard

waza serve

Explore results, trends, and comparisons in a web interface.

CI/CD Ready

Pre-configured GitHub Actions workflow:

waza init my-project
# Creates .github/workflows/eval.yml

Contributing

We welcome contributions! Here’s how to get involved:

Report Issues

Found a bug or have a feature request?

Check existing issues
Open a new issue with clear reproduction steps
Include error logs and environment details

Contribute Code

Fork the repository
Create a branch for your feature: git checkout -b feature/my-feature
Make changes following the code style
Write tests for new functionality
Run linter and tests:
Terminal window
```
make lint
make test
```
Commit with clear messages: feat: Add my feature
Push to your fork and open a pull request

Development Setup

# Clone repository
git clone https://github.com/microsoft/waza.git
cd waza

# Set up Go environment
go version  # Requires 1.26+

# Build
make build

# Test
make test

# Lint
make lint

# Run
./waza --help

Code Style

Follow Go idioms (Effective Go)
Use gofmt for formatting
Include unit tests for new code
Document public functions

Adding Validators

To add a new grader type:

Implement Validator interface in internal/scoring/
Register in ValidatorRegistry
Add tests
Document in README

Example:

type MyValidator struct {
    Config interface{} `json:"config"`
}

func (v *MyValidator) Grade(ctx *models.GradeContext) (*models.ValidationResult, error) {
    // Implementation
}

Documentation

Update relevant docs when changing behavior
Add examples for new features
Keep README.md and GUIDE.md in sync

Community

Discussions

Have questions or ideas? Join the conversation:

GitHub Discussions — Ask questions, share ideas
GitHub Issues — Report bugs, request features

Support

Documentation: docs/
Examples: examples/

Roadmap

E1: Go CLI Foundation (✅ Complete)

✅ waza run — Execute benchmarks
✅ waza init — Scaffold projects
✅ waza new skill — Create skills
✅ waza compare — Cross-model comparison
✅ All 11 grader types

E2: Sensei Engine (🟡 In Progress)

✅ waza check — Compliance scoring
✅ waza dev — Iterative improvement
🟡 Token budget optimization
🟡 Trigger accuracy testing

E3: Evaluation Framework (🟡 In Progress)

✅ Multi-model testing
✅ Comprehensive metrics
🟡 Statistical analysis
🟡 LLM-powered suggestions

E4: Token Management (⏳ Planned)

Token counting across models
Budget enforcement
Optimization recommendations

E5: Waza Skill (⏳ Planned)

Conversational skill interface
Interactive development

E6: CI/CD Integration (✅ Complete)

✅ GitHub Actions workflow
✅ Artifact handling
✅ PR comments

E7: AZD Extension (✅ Complete)

✅ Azure Developer CLI integration
✅ Registry publishing

License

MIT License

Author

Shayne Boyer (@spboyer)

Maintained by the waza team and community contributors.

Questions? Open an issue or start a discussion.

Inspiration

The waterfall timeline visualization in the waza dashboard was inspired by the .NET Aspire distributed application dashboard, which provides a similar trace/span view for distributed systems observability.