Quick Start

Get from zero to running your first evaluation benchmark in 5 minutes.

Prerequisites

GitHub Copilot access for the default provider, or custom Copilot SDK provider environment variables.
Go 1.26 or later and Git LFS only if installing from source.

1. Install

Choose one method:

curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
waza --version

The Bash installer detects the environment where Bash is running. From PowerShell, bash may resolve to WSL and install the Linux binary inside WSL.

Download waza-windows-amd64.exe or waza-windows-arm64.exe from the newest stable vX.Y.Z CLI release on the Releases page, rename it to waza.exe, and place it in a directory on your PATH.

waza --version

git clone https://github.com/microsoft/waza.git
cd waza
git lfs install
git lfs pull
go build -o waza ./cmd/waza
./waza --version

azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
azd ext install microsoft.azd.waza
azd waza --version

All commands below use waza — with azd extension, replace with azd waza.

2. Authenticate

By default, Waza uses the copilot-sdk executor and needs GitHub Copilot access for running evaluations:

copilot login

This opens your browser to authenticate. After login, you’re ready to go.

Waza bundles the GitHub Copilot CLI used by the copilot-sdk executor and extracts it to the local user cache on first use. Set COPILOT_CLI_PATH only when you need to force a specific Copilot CLI binary.

3. Create Your First Skill

Initialize a project and create a skill:

mkdir my-eval-suite
cd my-eval-suite
waza init
waza new skill my-skill

You’ll see:

✓ Created skill: skills/my-skill/
├── skill.yaml          # Skill definition
├── evals/
│   └── eval.yaml       # Evaluation spec
└── fixtures/
    ├── input.txt       # Sample task input
    └── README.md       # How to add more fixtures

4. Write Your First Eval

Open skills/my-skill/evals/eval.yaml and modify it to this minimal spec:

If your task needs files in the sandbox, list them under inputs.files; --context-dir is only the lookup base for those paths.

name: my-skill-eval
description: Test my skill
config:
  model: claude-sonnet-4.6
  timeout_seconds: 30

graders:
  - type: text
    name: has_response
    config:
      contains:
        - "How can I help"

tasks:
  - name: test-task-1
    description: Simple test
    input: "Hello, world!"
    expected: "Should say hello"

5. Run It

waza run skills/my-skill/evals/eval.yaml -v

You’ll see live execution:

Running evaluation: my-skill-eval
──────────────────────────────────

Task: test-task-1
Prompt: Hello, world!

Agent Response:
Hello! I'm an AI assistant. How can I help you?

Grading...
✓ has_response [PASS]

Task Summary:
  Passed: 1/1
  Score: 100%

6. View Results

Serve the interactive dashboard:

waza serve

Open your browser to http://localhost:3000 — you’ll see:

Dashboard — overview of all runs
Run Details — task-by-task breakdown with pass/fail
Scoring — individual grader results and weights
Trends — historical performance across runs

Workflow Diagram

graph LR
    A["Create Skill<br/>waza new skill"] --> B["Write Eval YAML<br/>eval.yaml"]
    B --> C["Run Evaluation<br/>waza run eval.yaml"]
    C --> D["View Results<br/>waza serve"]
    D --> E["Dashboard<br/>localhost:3000"]
    style A fill:#e1f5ff
    style B fill:#f3e5f5
    style C fill:#e8f5e9
    style D fill:#fff3e0
    style E fill:#fce4ec

Next Steps

Getting Started — Complete reference with project structure and workflow
Eval YAML Reference — Full spec for writing eval files
Validators & Graders — All 11 grader types with examples
Web Dashboard Guide — Features and navigation
Evaluating Custom Agents — Evaluate .agent.md files with automatic tool constraint validation
CI/CD Integration — Automate evaluations in GitHub Actions

Stuck? Open an issue on GitHub.