Quick Start
Get from zero to running your first evaluation benchmark in 5 minutes.
Prerequisites
Section titled “Prerequisites”- GitHub Copilot access for the default provider, or custom Copilot SDK provider environment variables.
- Go 1.26 or later and Git LFS only if installing from source.
1. Install
Section titled “1. Install”Choose one method:
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashwaza --versionThe Bash installer detects the environment where Bash is running. From PowerShell, bash may resolve to WSL and install the Linux binary inside WSL.
Download waza-windows-amd64.exe or waza-windows-arm64.exe from the latest release, rename it to waza.exe, and place it in a directory on your PATH.
waza --versiongit clone https://github.com/microsoft/waza.gitcd wazagit lfs installgit lfs pullgo build -o waza ./cmd/waza./waza --versionazd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.jsonazd ext install microsoft.azd.wazaazd waza --versionAll commands below use waza — with azd extension, replace with azd waza.
2. Authenticate
Section titled “2. Authenticate”By default, Waza uses the copilot-sdk executor and needs GitHub Copilot access for running evaluations:
copilot loginThis opens your browser to authenticate. After login, you’re ready to go.
3. Create Your First Skill
Section titled “3. Create Your First Skill”Initialize a project and create a skill:
mkdir my-eval-suitecd my-eval-suitewaza initwaza new skill my-skillYou’ll see:
✓ Created skill: skills/my-skill/├── skill.yaml # Skill definition├── evals/│ └── eval.yaml # Evaluation spec└── fixtures/ ├── input.txt # Sample task input └── README.md # How to add more fixtures4. Write Your First Eval
Section titled “4. Write Your First Eval”Open skills/my-skill/evals/eval.yaml and modify it to this minimal spec:
If your task needs files in the sandbox, list them under inputs.files; --context-dir is only the lookup base for those paths.
name: my-skill-evaldescription: Test my skillconfig: model: claude-sonnet-4.6 timeout_seconds: 30
graders: - type: text name: has_response config: contains: - "How can I help"
tasks: - name: test-task-1 description: Simple test input: "Hello, world!" expected: "Should say hello"5. Run It
Section titled “5. Run It”waza run skills/my-skill/evals/eval.yaml -vYou’ll see live execution:
Running evaluation: my-skill-eval──────────────────────────────────
Task: test-task-1Prompt: Hello, world!
Agent Response:Hello! I'm an AI assistant. How can I help you?
Grading...✓ has_response [PASS]
Task Summary: Passed: 1/1 Score: 100%6. View Results
Section titled “6. View Results”Serve the interactive dashboard:
waza serveOpen your browser to http://localhost:3000 — you’ll see:
- Dashboard — overview of all runs
- Run Details — task-by-task breakdown with pass/fail
- Scoring — individual grader results and weights
- Trends — historical performance across runs
Workflow Diagram
Section titled “Workflow Diagram”graph LR A["Create Skill<br/>waza new skill"] --> B["Write Eval YAML<br/>eval.yaml"] B --> C["Run Evaluation<br/>waza run eval.yaml"] C --> D["View Results<br/>waza serve"] D --> E["Dashboard<br/>localhost:3000"] style A fill:#e1f5ff style B fill:#f3e5f5 style C fill:#e8f5e9 style D fill:#fff3e0 style E fill:#fce4ecNext Steps
Section titled “Next Steps”- Getting Started — Complete reference with project structure and workflow
- Eval YAML Reference — Full spec for writing eval files
- Validators & Graders — All 11 grader types with examples
- Web Dashboard Guide — Features and navigation
- Evaluating Custom Agents — Evaluate
.agent.mdfiles with automatic tool constraint validation - CI/CD Integration — Automate evaluations in GitHub Actions
Stuck? Open an issue on GitHub.