CI/CD Integration

Automate waza evaluations in your CI/CD pipeline to catch regressions early and validate agent behavior across models.

Installation in CI

Via azd Extension (Recommended)

If your CI environment supports the Azure Developer CLI, use the waza azd extension:

# GitHub Actions example
- uses: Azure/setup-azd@v2

- run: |
    azd config set alpha.extensions on
    azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
    azd ext install microsoft.azd.waza

- run: azd waza run ./evals/my-skill/eval.yaml -v

Via Binary

Download the latest waza binary for your platform:

# Linux / macOS
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
export PATH="$HOME/bin:$PATH"

For Windows, download waza.exe from the latest GitHub release assets and add its directory to PATH.

Via Source (if Go is available)

git clone https://github.com/microsoft/waza.git
cd waza
go build -o waza ./cmd/waza

GitHub Actions

Quick Start ΓÇö Scaffold with waza init

The waza init command generates a ready-to-use workflow:

waza init my-project

This creates .github/workflows/eval.yml with sensible defaults.

Complete Example: Multi-Model Testing on PR

name: Evaluate Skills

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
    paths:
      - 'evals/**'
      - 'skills/**'
      - '.github/workflows/eval.yml'

permissions:
  contents: read
  pull-requests: write

jobs:
  evaluate:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        model: [gpt-4o, claude-sonnet-4.6, claude-opus-4]
      max-parallel: 3
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Install waza
        run: |
          curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
          echo "$HOME/bin" >> $GITHUB_PATH

      - name: Run evaluations (${{ matrix.model }})
        run: |
          waza run \
            --model "${{ matrix.model }}" \
            --output "results-${{ matrix.model }}.json" \
            --verbose

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: results-${{ matrix.model }}
          path: results-${{ matrix.model }}.json
          retention-days: 30

      - name: Comment PR with results
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('results-${{ matrix.model }}.json', 'utf8'));
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `### Eval Results: ${{ matrix.model }}\n\nΓ£à **Passed:** ${results.summary.succeeded}\nΓ¥î **Failed:** ${results.summary.failed}`
            });

Token Budget Checks

Use waza tokens compare to track token usage across PRs and fail if budgets are exceeded:

- name: Check token budget
  run: |
    waza tokens compare origin/main HEAD \
      --format table \
      --strict

This compares token counts between your main branch and current branch. Exit code is non-zero if any limit is exceeded with --strict.

Smart Path Filtering

Run evaluations only when relevant files change:

on:
  pull_request:
    branches: [main]
    paths:
      - 'evals/**'           # Eval config changed
      - 'skills/**'          # Skill code changed
      - 'examples/**'        # Examples changed
      - '.github/workflows/eval.yml'  # Workflow itself changed

jobs:
  evaluate:
    runs-on: ubuntu-latest

Caching Fixtures and Results

Cache fixture data to speed up repeated evaluations:

- name: Restore fixture cache
  uses: actions/cache@v4
  with:
    path: .waza-cache
    key: waza-cache-${{ hashFiles('examples/**/fixtures/**') }}
    restore-keys: waza-cache-

- name: Run with cache
  run: |
    waza run \
      --cache \
      --cache-dir .waza-cache \
      --verbose

Azure DevOps Pipelines

Basic Pipeline YAML

trigger:
  branches:
    include:
      - main
      - develop
  paths:
    include:
      - evals/**
      - skills/**
      - azure-pipelines.yml

pr:
  branches:
    include:
      - main

pool:
  vmImage: 'ubuntu-latest'

jobs:
  - job: EvaluateSkills
    displayName: Run Waza Evaluations
    timeoutInMinutes: 30
    strategy:
      matrix:
        GPT4:
          MODEL: 'gpt-4o'
        Claude:
          MODEL: 'claude-sonnet-4.6'

    steps:
      - checkout: self
        fetchDepth: 0  # Full history for diffs

      - script: |
          curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
          echo "##vso[task.prependpath]$HOME/bin"
        displayName: 'Install waza'

      - script: |
          waza run \
            --model "$(MODEL)" \
            --output "results-$(MODEL).json" \
            --verbose
        displayName: 'Run evaluations ($(MODEL))'
        env:
          GITHUB_TOKEN: $(System.AccessToken)

      - task: PublishBuildArtifacts@1
        inputs:
          pathToPublish: 'results-$(MODEL).json'
          artifactName: 'eval-results-$(MODEL)'
          publishLocation: 'Container'
        condition: always()

      - script: |
          echo "##vso[task.logissue type=error;]Evaluations failed"
          exit 1
        condition: failed()
        displayName: 'Report failure'

Token Budget in Azure DevOps

- script: |
    waza tokens compare origin/main HEAD \
      --format table \
      --strict
  displayName: 'Check token budget'

GitLab CI

Complete Pipeline

GitLab CI is highly portable and works anywhere your code lives:

stages:
  - evaluate

evaluate:
  stage: evaluate
  image: ubuntu:latest
  timeout: 30m
  parallel:
    matrix:
      - MODEL: gpt-4o
      - MODEL: claude-sonnet-4.6
      - MODEL: claude-opus-4

  before_script:
    - apt-get update && apt-get install -y curl ca-certificates
    - curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
    - export PATH="$HOME/bin:$PATH"

  script:
    - waza run --model $MODEL --output results-$MODEL.json --verbose

  artifacts:
    paths:
      - results-$MODEL.json
    expire_in: 30 days
    when: always

  only:
    variables:
      - $CI_COMMIT_BRANCH == "main" || $CI_COMMIT_BRANCH == "develop"
    changes:
      - evals/**
      - skills/**
      - .gitlab-ci.yml

  retry:
    max: 2
    when: script_failure

Best Practices

1. Run on Specific Triggers

Avoid running expensive evaluations on every commit. Use path filters:

on:
  pull_request:
    paths:
      - 'evals/**'      # Only if eval files changed
      - 'skills/**'     # Or if skill code changed

2. Matrix Testing for Multiple Models

Test across models in parallel to catch model-specific issues:

strategy:
  matrix:
    model: [gpt-4o, claude-sonnet-4.6]
  max-parallel: 2  # Limit concurrency to avoid rate limits

3. Artifact Retention

Keep results long enough to analyze trends but not forever:

retention-days: 30  # 30 days for typical workflows

4. Timeout Configuration

Set reasonable task timeouts to avoid hanging CI jobs:

# In eval.yaml
config:
  timeout_seconds: 300  # 5 minutes per task

# In CI workflow
timeout-minutes: 30     # 30 minutes total

5. Fail Fast on Critical Tests

Stop CI early on critical evaluation failures:

# In eval.yaml
config:
  fail_fast: true  # Stop on first failure

6. Token Budget Enforcement

Prevent models from exceeding token budgets in PRs:

- name: Check token budget
  run: waza tokens compare origin/main HEAD --format table --strict

7. Cache Results Intelligently

Cache fixture data across runs, but invalidate on eval config changes:

- uses: actions/cache@v4
  with:
    path: .waza-cache
    key: waza-cache-${{ hashFiles('evals/**/*.yaml') }}

8. Secure Secrets

Never hardcode API keys. Use CI secrets:

- name: Run evaluations
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: waza run -v

- script: waza run -v
  env:
    OPENAI_API_KEY: $(OPENAI_API_KEY)

script:
  - waza run -v
variables:
  OPENAI_API_KEY: $OPENAI_API_KEY

Troubleshooting

”waza: command not found”

Ensure the binary is in your PATH after installation:

# GitHub Actions
echo "$HOME/bin" >> $GITHUB_PATH

# Azure DevOps
echo "##vso[task.prependpath]$HOME/bin"

# GitLab CI (in before_script)
export PATH="$HOME/bin:$PATH"

Evaluations Timeout

Increase the timeout in both your CI workflow and eval config:

# CI level: GitHub Actions
timeout-minutes: 45

# Eval config level
config:
  timeout_seconds: 600  # 10 minutes per task

High Token Usage

Use waza tokens check to audit token consumption before running full evaluations:

waza tokens check ./skills/my-skill/

Rate Limiting from API Calls

Reduce parallel workers and add rate-limit awareness:

config:
  parallel: false  # Run tasks sequentially

Or use --workers 1 in your CI command.

Advanced Workflows

Comparing Models with Baseline

Use the baseline flag in your eval spec to establish a baseline, then compare new runs:

baseline: true
config:
  model: claude-sonnet-4.6

Then in CI, compare with:

waza compare baseline-results.json candidate-results.json

Multi-Stage Pipeline: PR ΓåÆ Staging ΓåÆ Production

# GitHub Actions example
on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  pr-eval:
    name: "Fast eval (PR)"
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: |
          curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
          $HOME/bin/waza run --tags "smoke" -v

  full-eval:
    name: "Full eval (after merge)"
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: |
          curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
          $HOME/bin/waza run -v

Next Steps

Writing Eval Specs ΓÇö Configure your evaluations
CLI Reference ΓÇö All flags and commands
Graders Guide ΓÇö Set up scoring rules
Examples ΓÇö Real eval setups