Skip to content

Quick Start

Get from zero to running your first evaluation benchmark in 5 minutes.

  • Go 1.26 or later (for binary install), or
  • GitHub Copilot access (for copilot login)

Choose one method:

Terminal window
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
waza --version

Waza needs GitHub Copilot access for running evaluations:

Terminal window
copilot login

This opens your browser to authenticate. After login, you’re ready to go.

Initialize a project and create a skill:

Terminal window
mkdir my-eval-suite
cd my-eval-suite
waza init
waza new skill my-skill

You’ll see:

✓ Created skill: skills/my-skill/
├── skill.yaml # Skill definition
├── evals/
│ └── eval.yaml # Evaluation spec
└── fixtures/
├── input.txt # Sample task input
└── README.md # How to add more fixtures

Open skills/my-skill/evals/eval.yaml and modify it to this minimal spec:

name: my-skill-eval
description: Test my skill
config:
model: claude-sonnet-4.6
timeout_seconds: 30
graders:
- type: text
name: has_response
config:
pattern: "\\w+"
tasks:
- name: test-task-1
description: Simple test
input: "Hello, world!"
expected: "Should say hello"
Terminal window
waza run skills/my-skill/evals/eval.yaml -v

You’ll see live execution:

Running evaluation: my-skill-eval
──────────────────────────────────
Task: test-task-1
Prompt: Hello, world!
Agent Response:
Hello! I'm an AI assistant. How can I help you?
Grading...
✓ has_response [PASS]
Task Summary:
Passed: 1/1
Score: 100%

Serve the interactive dashboard:

Terminal window
waza serve

Open your browser to http://localhost:3000 — you’ll see:

  • Dashboard — overview of all runs
  • Run Details — task-by-task breakdown with pass/fail
  • Scoring — individual grader results and weights
  • Trends — historical performance across runs
graph LR
A["Create Skill<br/>waza new skill"] --> B["Write Eval YAML<br/>eval.yaml"]
B --> C["Run Evaluation<br/>waza run eval.yaml"]
C --> D["View Results<br/>waza serve"]
D --> E["Dashboard<br/>localhost:3000"]
style A fill:#e1f5ff
style B fill:#f3e5f5
style C fill:#e8f5e9
style D fill:#fff3e0
style E fill:#fce4ec

Stuck? Open an issue on GitHub.