Skip to content

CI/CD Integration

Integrate waza evaluations into your GitHub Actions CI/CD pipeline.

Waza scaffolds a ready-to-use workflow with waza init:

Terminal window
waza init my-project

Creates .github/workflows/eval.yml:

name: Evaluation
on:
push:
branches: [main]
pull_request:
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install waza
run: |
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
echo "$HOME/bin" >> $GITHUB_PATH
- name: Run evaluations
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: waza run -v -o results.json
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results.json

Create your own workflow in .github/workflows/eval.yml:

name: Run Evaluations
on:
push:
branches: [main, develop]
pull_request:
types: [opened, synchronize]
jobs:
evaluate:
runs-on: ubuntu-latest
strategy:
matrix:
model: [gpt-4o, claude-sonnet-4.6]
steps:
- uses: actions/checkout@v4
- name: Install waza
run: |
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
- name: Run evals with ${{ matrix.model }}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
waza run \
--model "${{ matrix.model }}" \
-o "results-${{ matrix.model }}.json" \
-v
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: results-${{ matrix.model }}
path: results-${{ matrix.model }}.json

Test across multiple models in parallel:

strategy:
matrix:
model:
- gpt-4o
- claude-sonnet-4.6
- claude-opus-4
max-parallel: 3
steps:
- name: Run evals for ${{ matrix.model }}
run: waza run --model "${{ matrix.model }}" -o "results-${{ matrix.model }}.json"

Run subset of tasks in CI to save time:

- name: Run fast tests
run: waza run --tags "smoke" -v
- name: Run comprehensive tests (nightly)
if: github.event_name == 'schedule'
run: waza run -v

Run tasks in parallel:

- name: Run evaluations in parallel
run: waza run --parallel --workers 8 -v

Save results for later analysis:

- name: Upload evaluation results
uses: actions/upload-artifact@v4
with:
name: eval-results-${{ github.run_id }}
path: results.json
retention-days: 30

Download in dashboard:

Terminal window
gh run download <run-id> -n eval-results-<run-id>
waza serve

Post results as GitHub comment:

- name: Run evaluations
run: waza run -o results.json --format github-comment > comment.md
- name: Post comment
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const comment = fs.readFileSync('comment.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});

Run evaluations only on specific conditions:

- name: Check if eval config changed
id: check
run: |
if git diff-tree --no-commit-id --name-only -r HEAD | grep -q "evals/"; then
echo "EVAL_CHANGED=true" >> $GITHUB_OUTPUT
fi
- name: Run evaluations
if: steps.check.outputs.EVAL_CHANGED == 'true' || github.event_name == 'workflow_dispatch'
run: waza run -v

Cache evaluation results to speed up repeated runs:

- name: Cache waza results
uses: actions/cache@v3
with:
path: .waza-cache
key: waza-cache-${{ hashFiles('evals/**') }}
restore-keys: waza-cache-
- name: Run evaluations with cache
run: waza run --cache --cache-dir .waza-cache -v

Run evaluations on a schedule (e.g., daily):

on:
schedule:
- cron: '0 0 * * *' # Daily at midnight UTC

Pass credentials safely:

- name: Run evaluations
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
API_KEY: ${{ secrets.API_KEY }}
run: waza run -v

Create .github/workflows/eval.yml in your repository.

If using github.rest.issues.createComment:

permissions:
contents: read
issues: write
pull-requests: write

GitHub Actions automatically provides GITHUB_TOKEN. For other APIs, add secrets in repository settings.

Test workflow locally with act:

Terminal window
# Install act
brew install act
# Run workflow
act push

Fail the workflow if evaluation fails:

- name: Run evaluations
run: waza run -v
# Exit code 1 if tasks fail, workflow fails
  1. Cache fixtures — Speed up repeated runs
  2. Matrix testing — Test multiple models in parallel
  3. Artifact retention — Keep results for dashboard analysis
  4. Conditional runs — Skip unnecessary evaluations
  5. Timeout — Set reasonable timeouts to catch hanging tasks
  6. Notify on failure — Post comments or create issues

Ensure $HOME/bin is in PATH:

- name: Install waza
run: |
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
echo "$HOME/bin" >> $GITHUB_PATH

Check working directory:

- name: Run from correct directory
run: |
ls -la
waza run evals/my-skill/eval.yaml -v

Increase timeout or use --parallel:

- name: Run with longer timeout
run: waza run --config.timeout_seconds 600 -v