Skip to content

CI/CD Integration

Automate waza evaluations in your CI/CD pipeline to catch regressions early and validate agent behavior across models.

If your CI environment supports the Azure Developer CLI, use the waza azd extension:

# GitHub Actions example
- uses: Azure/setup-azd@v2
- run: |
azd config set alpha.extensions on
azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
azd ext install microsoft.azd.waza
- run: azd waza run ./evals/my-skill/eval.yaml -v

Download the latest waza binary for your platform:

Terminal window
# Linux / macOS
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
export PATH="$HOME/bin:$PATH"

For Windows, download waza.exe from the latest GitHub release assets and add its directory to PATH.

Terminal window
git clone https://github.com/microsoft/waza.git
cd waza
go build -o waza ./cmd/waza

Quick Start ΓÇö Scaffold with waza init

Section titled “Quick Start ΓÇö Scaffold with waza init”

The waza init command generates a ready-to-use workflow:

Terminal window
waza init my-project

This creates .github/workflows/eval.yml with sensible defaults.

Complete Example: Multi-Model Testing on PR

Section titled “Complete Example: Multi-Model Testing on PR”
name: Evaluate Skills
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
paths:
- 'evals/**'
- 'skills/**'
- '.github/workflows/eval.yml'
permissions:
contents: read
pull-requests: write
jobs:
evaluate:
runs-on: ubuntu-latest
strategy:
matrix:
model: [gpt-4o, claude-sonnet-4.6, claude-opus-4]
max-parallel: 3
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- name: Install waza
run: |
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
echo "$HOME/bin" >> $GITHUB_PATH
- name: Run evaluations (${{ matrix.model }})
run: |
waza run \
--model "${{ matrix.model }}" \
--output "results-${{ matrix.model }}.json" \
--verbose
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: results-${{ matrix.model }}
path: results-${{ matrix.model }}.json
retention-days: 30
- name: Comment PR with results
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('results-${{ matrix.model }}.json', 'utf8'));
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `### Eval Results: ${{ matrix.model }}\n\n✅ **Passed:** ${results.summary.succeeded}\n❌ **Failed:** ${results.summary.failed}`
});

Use waza tokens compare to track token usage across PRs and fail if budgets are exceeded:

- name: Check token budget
run: |
waza tokens compare origin/main HEAD \
--format table \
--strict

This compares token counts between your main branch and current branch. Exit code is non-zero if any limit is exceeded with --strict.

Run evaluations only when relevant files change:

on:
pull_request:
branches: [main]
paths:
- 'evals/**' # Eval config changed
- 'skills/**' # Skill code changed
- 'examples/**' # Examples changed
- '.github/workflows/eval.yml' # Workflow itself changed
jobs:
evaluate:
runs-on: ubuntu-latest

Cache fixture data to speed up repeated evaluations:

- name: Restore fixture cache
uses: actions/cache@v4
with:
path: .waza-cache
key: waza-cache-${{ hashFiles('examples/**/fixtures/**') }}
restore-keys: waza-cache-
- name: Run with cache
run: |
waza run \
--cache \
--cache-dir .waza-cache \
--verbose
trigger:
branches:
include:
- main
- develop
paths:
include:
- evals/**
- skills/**
- azure-pipelines.yml
pr:
branches:
include:
- main
pool:
vmImage: 'ubuntu-latest'
jobs:
- job: EvaluateSkills
displayName: Run Waza Evaluations
timeoutInMinutes: 30
strategy:
matrix:
GPT4:
MODEL: 'gpt-4o'
Claude:
MODEL: 'claude-sonnet-4.6'
steps:
- checkout: self
fetchDepth: 0 # Full history for diffs
- script: |
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
echo "##vso[task.prependpath]$HOME/bin"
displayName: 'Install waza'
- script: |
waza run \
--model "$(MODEL)" \
--output "results-$(MODEL).json" \
--verbose
displayName: 'Run evaluations ($(MODEL))'
env:
GITHUB_TOKEN: $(System.AccessToken)
- task: PublishBuildArtifacts@1
inputs:
pathToPublish: 'results-$(MODEL).json'
artifactName: 'eval-results-$(MODEL)'
publishLocation: 'Container'
condition: always()
- script: |
echo "##vso[task.logissue type=error;]Evaluations failed"
exit 1
condition: failed()
displayName: 'Report failure'
- script: |
waza tokens compare origin/main HEAD \
--format table \
--strict
displayName: 'Check token budget'

GitLab CI is highly portable and works anywhere your code lives:

stages:
- evaluate
evaluate:
stage: evaluate
image: ubuntu:latest
timeout: 30m
parallel:
matrix:
- MODEL: gpt-4o
- MODEL: claude-sonnet-4.6
- MODEL: claude-opus-4
before_script:
- apt-get update && apt-get install -y curl ca-certificates
- curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
- export PATH="$HOME/bin:$PATH"
script:
- waza run --model $MODEL --output results-$MODEL.json --verbose
artifacts:
paths:
- results-$MODEL.json
expire_in: 30 days
when: always
only:
variables:
- $CI_COMMIT_BRANCH == "main" || $CI_COMMIT_BRANCH == "develop"
changes:
- evals/**
- skills/**
- .gitlab-ci.yml
retry:
max: 2
when: script_failure

Avoid running expensive evaluations on every commit. Use path filters:

on:
pull_request:
paths:
- 'evals/**' # Only if eval files changed
- 'skills/**' # Or if skill code changed

Test across models in parallel to catch model-specific issues:

strategy:
matrix:
model: [gpt-4o, claude-sonnet-4.6]
max-parallel: 2 # Limit concurrency to avoid rate limits

Keep results long enough to analyze trends but not forever:

retention-days: 30 # 30 days for typical workflows

Set reasonable task timeouts to avoid hanging CI jobs:

# In eval.yaml
config:
timeout_seconds: 300 # 5 minutes per task
# In CI workflow
timeout-minutes: 30 # 30 minutes total

Stop CI early on critical evaluation failures:

# In eval.yaml
config:
fail_fast: true # Stop on first failure

Prevent models from exceeding token budgets in PRs:

- name: Check token budget
run: waza tokens compare origin/main HEAD --format table --strict

Cache fixture data across runs, but invalidate on eval config changes:

- uses: actions/cache@v4
with:
path: .waza-cache
key: waza-cache-${{ hashFiles('evals/**/*.yaml') }}

Never hardcode API keys. Use CI secrets:

- name: Run evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: waza run -v

Ensure the binary is in your PATH after installation:

Terminal window
# GitHub Actions
echo "$HOME/bin" >> $GITHUB_PATH
# Azure DevOps
echo "##vso[task.prependpath]$HOME/bin"
# GitLab CI (in before_script)
export PATH="$HOME/bin:$PATH"

Increase the timeout in both your CI workflow and eval config:

# CI level: GitHub Actions
timeout-minutes: 45
# Eval config level
config:
timeout_seconds: 600 # 10 minutes per task

Use waza tokens check to audit token consumption before running full evaluations:

Terminal window
waza tokens check ./skills/my-skill/

Reduce parallel workers and add rate-limit awareness:

config:
parallel: false # Run tasks sequentially

Or use --workers 1 in your CI command.

Use the baseline flag in your eval spec to establish a baseline, then compare new runs:

baseline-eval.yaml
baseline: true
config:
model: claude-sonnet-4.6

Then in CI, compare with:

Terminal window
waza compare baseline-results.json candidate-results.json

Multi-Stage Pipeline: PR → Staging → Production

Section titled “Multi-Stage Pipeline: PR ΓåÆ Staging ΓåÆ Production”
# GitHub Actions example
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
pr-eval:
name: "Fast eval (PR)"
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
$HOME/bin/waza run --tags "smoke" -v
full-eval:
name: "Full eval (after merge)"
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
$HOME/bin/waza run -v