CI/CD Integration
Automate waza evaluations in your CI/CD pipeline to catch regressions early and validate agent behavior across models.
Installation in CI
Section titled “Installation in CI”Via azd Extension (Recommended)
Section titled “Via azd Extension (Recommended)”If your CI environment supports the Azure Developer CLI, use the waza azd extension:
# GitHub Actions example- uses: Azure/setup-azd@v2
- run: | azd config set alpha.extensions on azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json azd ext install microsoft.azd.waza
- run: azd waza run ./evals/my-skill/eval.yaml -vVia Binary
Section titled “Via Binary”Download the latest waza binary for your platform:
# Linux / macOScurl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bashexport PATH="$HOME/bin:$PATH"For Windows, download waza.exe from the latest GitHub release assets and add its directory to PATH.
Via Source (if Go is available)
Section titled “Via Source (if Go is available)”git clone https://github.com/microsoft/waza.gitcd wazago build -o waza ./cmd/wazaGitHub Actions
Section titled “GitHub Actions”Quick Start ΓÇö Scaffold with waza init
Section titled “Quick Start ΓÇö Scaffold with waza init”The waza init command generates a ready-to-use workflow:
waza init my-projectThis creates .github/workflows/eval.yml with sensible defaults.
Complete Example: Multi-Model Testing on PR
Section titled “Complete Example: Multi-Model Testing on PR”name: Evaluate Skills
on: push: branches: [main, develop] pull_request: branches: [main] paths: - 'evals/**' - 'skills/**' - '.github/workflows/eval.yml'
permissions: contents: read pull-requests: write
jobs: evaluate: runs-on: ubuntu-latest strategy: matrix: model: [gpt-4o, claude-sonnet-4.6, claude-opus-4] max-parallel: 3 timeout-minutes: 30
steps: - uses: actions/checkout@v4
- name: Install waza run: | curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash echo "$HOME/bin" >> $GITHUB_PATH
- name: Run evaluations (${{ matrix.model }}) run: | waza run \ --model "${{ matrix.model }}" \ --output "results-${{ matrix.model }}.json" \ --verbose
- name: Upload results uses: actions/upload-artifact@v4 with: name: results-${{ matrix.model }} path: results-${{ matrix.model }}.json retention-days: 30
- name: Comment PR with results if: github.event_name == 'pull_request' uses: actions/github-script@v7 with: script: | const fs = require('fs'); const results = JSON.parse(fs.readFileSync('results-${{ matrix.model }}.json', 'utf8')); github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: `### Eval Results: ${{ matrix.model }}\n\n✅ **Passed:** ${results.summary.succeeded}\n❌ **Failed:** ${results.summary.failed}` });Token Budget Checks
Section titled “Token Budget Checks”Use waza tokens compare to track token usage across PRs and fail if budgets are exceeded:
- name: Check token budget run: | waza tokens compare origin/main HEAD \ --format table \ --strictThis compares token counts between your main branch and current branch. Exit code is non-zero if any limit is exceeded with --strict.
Smart Path Filtering
Section titled “Smart Path Filtering”Run evaluations only when relevant files change:
on: pull_request: branches: [main] paths: - 'evals/**' # Eval config changed - 'skills/**' # Skill code changed - 'examples/**' # Examples changed - '.github/workflows/eval.yml' # Workflow itself changed
jobs: evaluate: runs-on: ubuntu-latestCaching Fixtures and Results
Section titled “Caching Fixtures and Results”Cache fixture data to speed up repeated evaluations:
- name: Restore fixture cache uses: actions/cache@v4 with: path: .waza-cache key: waza-cache-${{ hashFiles('examples/**/fixtures/**') }} restore-keys: waza-cache-
- name: Run with cache run: | waza run \ --cache \ --cache-dir .waza-cache \ --verboseAzure DevOps Pipelines
Section titled “Azure DevOps Pipelines”Basic Pipeline YAML
Section titled “Basic Pipeline YAML”trigger: branches: include: - main - develop paths: include: - evals/** - skills/** - azure-pipelines.yml
pr: branches: include: - main
pool: vmImage: 'ubuntu-latest'
jobs: - job: EvaluateSkills displayName: Run Waza Evaluations timeoutInMinutes: 30 strategy: matrix: GPT4: MODEL: 'gpt-4o' Claude: MODEL: 'claude-sonnet-4.6'
steps: - checkout: self fetchDepth: 0 # Full history for diffs
- script: | curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash echo "##vso[task.prependpath]$HOME/bin" displayName: 'Install waza'
- script: | waza run \ --model "$(MODEL)" \ --output "results-$(MODEL).json" \ --verbose displayName: 'Run evaluations ($(MODEL))' env: GITHUB_TOKEN: $(System.AccessToken)
- task: PublishBuildArtifacts@1 inputs: pathToPublish: 'results-$(MODEL).json' artifactName: 'eval-results-$(MODEL)' publishLocation: 'Container' condition: always()
- script: | echo "##vso[task.logissue type=error;]Evaluations failed" exit 1 condition: failed() displayName: 'Report failure'Token Budget in Azure DevOps
Section titled “Token Budget in Azure DevOps”- script: | waza tokens compare origin/main HEAD \ --format table \ --strict displayName: 'Check token budget'GitLab CI
Section titled “GitLab CI”Complete Pipeline
Section titled “Complete Pipeline”GitLab CI is highly portable and works anywhere your code lives:
stages: - evaluate
evaluate: stage: evaluate image: ubuntu:latest timeout: 30m parallel: matrix: - MODEL: gpt-4o - MODEL: claude-sonnet-4.6 - MODEL: claude-opus-4
before_script: - apt-get update && apt-get install -y curl ca-certificates - curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash - export PATH="$HOME/bin:$PATH"
script: - waza run --model $MODEL --output results-$MODEL.json --verbose
artifacts: paths: - results-$MODEL.json expire_in: 30 days when: always
only: variables: - $CI_COMMIT_BRANCH == "main" || $CI_COMMIT_BRANCH == "develop" changes: - evals/** - skills/** - .gitlab-ci.yml
retry: max: 2 when: script_failureBest Practices
Section titled “Best Practices”1. Run on Specific Triggers
Section titled “1. Run on Specific Triggers”Avoid running expensive evaluations on every commit. Use path filters:
on: pull_request: paths: - 'evals/**' # Only if eval files changed - 'skills/**' # Or if skill code changed2. Matrix Testing for Multiple Models
Section titled “2. Matrix Testing for Multiple Models”Test across models in parallel to catch model-specific issues:
strategy: matrix: model: [gpt-4o, claude-sonnet-4.6] max-parallel: 2 # Limit concurrency to avoid rate limits3. Artifact Retention
Section titled “3. Artifact Retention”Keep results long enough to analyze trends but not forever:
retention-days: 30 # 30 days for typical workflows4. Timeout Configuration
Section titled “4. Timeout Configuration”Set reasonable task timeouts to avoid hanging CI jobs:
# In eval.yamlconfig: timeout_seconds: 300 # 5 minutes per task
# In CI workflowtimeout-minutes: 30 # 30 minutes total5. Fail Fast on Critical Tests
Section titled “5. Fail Fast on Critical Tests”Stop CI early on critical evaluation failures:
# In eval.yamlconfig: fail_fast: true # Stop on first failure6. Token Budget Enforcement
Section titled “6. Token Budget Enforcement”Prevent models from exceeding token budgets in PRs:
- name: Check token budget run: waza tokens compare origin/main HEAD --format table --strict7. Cache Results Intelligently
Section titled “7. Cache Results Intelligently”Cache fixture data across runs, but invalidate on eval config changes:
- uses: actions/cache@v4 with: path: .waza-cache key: waza-cache-${{ hashFiles('evals/**/*.yaml') }}8. Secure Secrets
Section titled “8. Secure Secrets”Never hardcode API keys. Use CI secrets:
- name: Run evaluations env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: waza run -v- script: waza run -v env: OPENAI_API_KEY: $(OPENAI_API_KEY)script: - waza run -vvariables: OPENAI_API_KEY: $OPENAI_API_KEYTroubleshooting
Section titled “Troubleshooting””waza: command not found”
Section titled “”waza: command not found””Ensure the binary is in your PATH after installation:
# GitHub Actionsecho "$HOME/bin" >> $GITHUB_PATH
# Azure DevOpsecho "##vso[task.prependpath]$HOME/bin"
# GitLab CI (in before_script)export PATH="$HOME/bin:$PATH"Evaluations Timeout
Section titled “Evaluations Timeout”Increase the timeout in both your CI workflow and eval config:
# CI level: GitHub Actionstimeout-minutes: 45
# Eval config levelconfig: timeout_seconds: 600 # 10 minutes per taskHigh Token Usage
Section titled “High Token Usage”Use waza tokens check to audit token consumption before running full evaluations:
waza tokens check ./skills/my-skill/Rate Limiting from API Calls
Section titled “Rate Limiting from API Calls”Reduce parallel workers and add rate-limit awareness:
config: parallel: false # Run tasks sequentiallyOr use --workers 1 in your CI command.
Advanced Workflows
Section titled “Advanced Workflows”Comparing Models with Baseline
Section titled “Comparing Models with Baseline”Use the baseline flag in your eval spec to establish a baseline, then compare new runs:
baseline: trueconfig: model: claude-sonnet-4.6Then in CI, compare with:
waza compare baseline-results.json candidate-results.jsonMulti-Stage Pipeline: PR → Staging → Production
Section titled “Multi-Stage Pipeline: PR ΓåÆ Staging ΓåÆ Production”# GitHub Actions exampleon: pull_request: branches: [main] push: branches: [main]
jobs: pr-eval: name: "Fast eval (PR)" if: github.event_name == 'pull_request' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: | curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash $HOME/bin/waza run --tags "smoke" -v
full-eval: name: "Full eval (after merge)" if: github.event_name == 'push' && github.ref == 'refs/heads/main' runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: | curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash $HOME/bin/waza run -vNext Steps
Section titled “Next Steps”- Writing Eval Specs ΓÇö Configure your evaluations
- CLI Reference ΓÇö All flags and commands
- Graders Guide ΓÇö Set up scoring rules
- Examples ΓÇö Real eval setups