Skip to content

Quick Start

Get from zero to running your first evaluation benchmark in 5 minutes.

  • GitHub Copilot access for the default provider, or custom Copilot SDK provider environment variables.
  • Go 1.26 or later and Git LFS only if installing from source.

Choose one method:

Terminal window
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
waza --version

The Bash installer detects the environment where Bash is running. From PowerShell, bash may resolve to WSL and install the Linux binary inside WSL.

By default, Waza uses the copilot-sdk executor and needs GitHub Copilot access for running evaluations:

Terminal window
copilot login

This opens your browser to authenticate. After login, you’re ready to go.

Initialize a project and create a skill:

Terminal window
mkdir my-eval-suite
cd my-eval-suite
waza init
waza new skill my-skill

You’ll see:

✓ Created skill: skills/my-skill/
├── skill.yaml # Skill definition
├── evals/
│ └── eval.yaml # Evaluation spec
└── fixtures/
├── input.txt # Sample task input
└── README.md # How to add more fixtures

Open skills/my-skill/evals/eval.yaml and modify it to this minimal spec:

If your task needs files in the sandbox, list them under inputs.files; --context-dir is only the lookup base for those paths.

name: my-skill-eval
description: Test my skill
config:
model: claude-sonnet-4.6
timeout_seconds: 30
graders:
- type: text
name: has_response
config:
contains:
- "How can I help"
tasks:
- name: test-task-1
description: Simple test
input: "Hello, world!"
expected: "Should say hello"
Terminal window
waza run skills/my-skill/evals/eval.yaml -v

You’ll see live execution:

Running evaluation: my-skill-eval
──────────────────────────────────
Task: test-task-1
Prompt: Hello, world!
Agent Response:
Hello! I'm an AI assistant. How can I help you?
Grading...
✓ has_response [PASS]
Task Summary:
Passed: 1/1
Score: 100%

Serve the interactive dashboard:

Terminal window
waza serve

Open your browser to http://localhost:3000 — you’ll see:

  • Dashboard — overview of all runs
  • Run Details — task-by-task breakdown with pass/fail
  • Scoring — individual grader results and weights
  • Trends — historical performance across runs
graph LR
A["Create Skill<br/>waza new skill"] --> B["Write Eval YAML<br/>eval.yaml"]
B --> C["Run Evaluation<br/>waza run eval.yaml"]
C --> D["View Results<br/>waza serve"]
D --> E["Dashboard<br/>localhost:3000"]
style A fill:#e1f5ff
style B fill:#f3e5f5
style C fill:#e8f5e9
style D fill:#fff3e0
style E fill:#fce4ec

Stuck? Open an issue on GitHub.