About
What is Waza?
Section titled “What is Waza?”Waza (技 - Japanese for “skill/technique”) is a unified CLI platform for creating, testing, and evaluating AI agent skills.
It consolidates existing skill development tools into a single binary that provides the complete developer experience for creating, testing, and improving AI agent skills across any domain or platform.
The Problem
Section titled “The Problem”Creating, testing, and evaluating AI agent skills lacks consistent tooling:
- Automated compliance validation — No standardized scoring to ensure skill quality
- Trigger testing — Manual verification of skill activation patterns
- Cross-model evaluation — No framework for testing skills across GPT-4o, Claude, etc.
- Token budget enforcement — Guidelines exist but aren’t automatically checked
The Solution
Section titled “The Solution”Waza automates the skill development workflow:
| Phase | Capability |
|---|---|
| Scaffold | Generate a compliant skill structure ready for evaluation |
| Develop | Iterate with real-time compliance scoring |
| Test | Run agentic test loops with real LLM execution |
| Evaluate | Cross-model comparison with comprehensive metrics |
Architecture
Section titled “Architecture”Waza is built in Go for:
- Single-binary distribution (no dependencies)
- Fast execution
- Cross-platform compatibility (Linux, macOS, Windows)
Components
Section titled “Components”waza/├── cmd/waza/ # CLI entrypoint├── internal/│ ├── config/ # Configuration│ ├── execution/ # Agent engines│ ├── models/ # Data structures│ ├── orchestration/ # Test runner│ └── scoring/ # Validators├── web/ # Dashboard (React + Tailwind)└── examples/ # Example evalsDesign Principles
Section titled “Design Principles”- Fixture Isolation — Each task gets a fresh temp workspace. Original fixtures never modified.
- Pluggable Validators — 11 grader types, easily extended
- Cross-Model Support — Test skills against multiple LLM providers
- Local-First — Mock executor for development, real API for CI/CD
- Observability — Full transcripts, detailed metrics, dashboard visualization
Key Features
Section titled “Key Features”Structured Benchmarks
Section titled “Structured Benchmarks”Define test cases in YAML:
name: code-explainer-evaltasks: - "tasks/*.yaml"
graders: - type: text config: regex_match: ["function", "parameter"]11 Built-In Validators
Section titled “11 Built-In Validators”- Code — Python assertions
- Text — Text matching
- File — Output file checking
- Diff — Diff comparison
- JSON Schema — JSON structure validation
- Prompt — LLM-powered evaluation
- Behavior — Agent behavior validation
- Action Sequence — Tool call sequence validation
- Skill Invocation — Skill invocation validation
- Program — Program execution validation
Multi-Model Comparison
Section titled “Multi-Model Comparison”waza run eval.yaml --model gpt-4o -o gpt4.jsonwaza run eval.yaml --model claude-sonnet-4.6 -o sonnet.jsonwaza compare gpt4.json sonnet.jsonInteractive Dashboard
Section titled “Interactive Dashboard”waza serveExplore results, trends, and comparisons in a web interface.
CI/CD Ready
Section titled “CI/CD Ready”Pre-configured GitHub Actions workflow:
waza init my-project# Creates .github/workflows/eval.ymlContributing
Section titled “Contributing”We welcome contributions! Here’s how to get involved:
Report Issues
Section titled “Report Issues”Found a bug or have a feature request?
- Check existing issues
- Open a new issue with clear reproduction steps
- Include error logs and environment details
Contribute Code
Section titled “Contribute Code”- Fork the repository
- Create a branch for your feature:
git checkout -b feature/my-feature - Make changes following the code style
- Write tests for new functionality
- Run linter and tests:
Terminal window make lintmake test - Commit with clear messages:
feat: Add my feature - Push to your fork and open a pull request
Development Setup
Section titled “Development Setup”# Clone repositorygit clone https://github.com/microsoft/waza.gitcd waza
# Set up Go environmentgo version # Requires 1.26+
# Buildmake build
# Testmake test
# Lintmake lint
# Run./waza --helpCode Style
Section titled “Code Style”- Follow Go idioms (Effective Go)
- Use
gofmtfor formatting - Include unit tests for new code
- Document public functions
Adding Validators
Section titled “Adding Validators”To add a new grader type:
- Implement
Validatorinterface ininternal/scoring/ - Register in
ValidatorRegistry - Add tests
- Document in README
Example:
type MyValidator struct { Config interface{} `json:"config"`}
func (v *MyValidator) Grade(ctx *models.GradeContext) (*models.ValidationResult, error) { // Implementation}Documentation
Section titled “Documentation”- Update relevant docs when changing behavior
- Add examples for new features
- Keep README.md and GUIDE.md in sync
Community
Section titled “Community”Discussions
Section titled “Discussions”Have questions or ideas? Join the conversation:
- GitHub Discussions — Ask questions, share ideas
- GitHub Issues — Report bugs, request features
Support
Section titled “Support”Roadmap
Section titled “Roadmap”E1: Go CLI Foundation (✅ Complete)
Section titled “E1: Go CLI Foundation (✅ Complete)”- ✅ waza run — Execute benchmarks
- ✅ waza init — Scaffold projects
- ✅ waza new — Create skills
- ✅ waza compare — Cross-model comparison
- ✅ All 11 grader types
E2: Sensei Engine (🟡 In Progress)
Section titled “E2: Sensei Engine (🟡 In Progress)”- ✅ waza check — Compliance scoring
- ✅ waza dev — Iterative improvement
- 🟡 Token budget optimization
- 🟡 Trigger accuracy testing
E3: Evaluation Framework (🟡 In Progress)
Section titled “E3: Evaluation Framework (🟡 In Progress)”- ✅ Multi-model testing
- ✅ Comprehensive metrics
- 🟡 Statistical analysis
- 🟡 LLM-powered suggestions
E4: Token Management (⏳ Planned)
Section titled “E4: Token Management (⏳ Planned)”- Token counting across models
- Budget enforcement
- Optimization recommendations
E5: Waza Skill (⏳ Planned)
Section titled “E5: Waza Skill (⏳ Planned)”- Conversational skill interface
- Interactive development
E6: CI/CD Integration (✅ Complete)
Section titled “E6: CI/CD Integration (✅ Complete)”- ✅ GitHub Actions workflow
- ✅ Artifact handling
- ✅ PR comments
E7: AZD Extension (✅ Complete)
Section titled “E7: AZD Extension (✅ Complete)”- ✅ Azure Developer CLI integration
- ✅ Registry publishing
License
Section titled “License”Author
Section titled “Author”Shayne Boyer (@spboyer)
Maintained by the waza team and community contributors.
Questions? Open an issue or start a discussion.
Inspiration
Section titled “Inspiration”The waterfall timeline visualization in the waza dashboard was inspired by the .NET Aspire distributed application dashboard, which provides a similar trace/span view for distributed systems observability.