CLI Commands

Complete reference for all waza CLI commands and their options.

Installation

# macOS/Linux or Windows Git Bash/MSYS/Cygwin
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
waza --version

For native Windows PowerShell:

irm https://raw.githubusercontent.com/microsoft/waza/main/install.ps1 | iex
waza --version

waza update

Update waza to the latest release.

waza update

The command prompts for confirmation, then downloads and runs the official installer for the current OS. It uses the Bash installer on macOS/Linux and the PowerShell installer on native Windows. Use --yes to skip the prompt in scripted environments:

waza update --yes

The native Windows update path schedules the final binary replacement after the running waza.exe exits, avoiding Windows executable file locks. In WSL, waza update uses the Linux Bash installer and updates the WSL installation.

waza run

Run an evaluation benchmark.

waza run [eval.yaml | skill-name | agent-name]

Arguments

Argument	Description
`[eval.yaml]`	Path to evaluation spec file
`[skill-name]`	Skill name (auto-detects eval.yaml)
`[agent-name]`	Custom agent name (auto-detects eval.yaml for `.agent.md` files)
(none)	Auto-detect using workspace detection

Custom agents: Waza discovers both SKILL.md and .agent.md files. You can target a custom agent by name just like a skill. See Evaluating Custom Agents for details.

APM-managed skills: Waza detects compiled APM skills at skills/<name>/.apm/skills/<name>/SKILL.md, so waza run <name> and no-argument workspace detection work without symlinking SKILL.md into the source skill directory. If both skills/<name>/SKILL.md and the APM compiled output exist for the same skill, the top-level SKILL.md takes precedence.

Flags

Flag	Short	Type	Default	Description
`--context-dir`	`-c`	string	`./fixtures`	Fixtures directory path
`--output`	`-o`	string		Save results JSON to file
`--output-dir`	`-d`	string		Directory for structured output; each run creates a UTC-timestamped subdirectory of the specified directory. Mutually exclusive with `--output`.
`--verbose`	`-v`	bool	false	Detailed progress output
`--parallel`		bool	false	Run tasks concurrently
`--workers`	`-w`	int	auto	Number of concurrent workers
`--trials`		int	`config.trials_per_task`	Run each task N times for flakiness detection (omit to use `config.trials_per_task`; when provided, value must be >= 1)
`--task`	`-t`	string		Filter tasks by name (repeatable)
`--tags`		string		Filter tasks by tags (repeatable)
`--model`	`-m`	string		Override model (repeatable)
`--judge-model`		string		Model for LLM-as-judge graders (overrides execution model)
`--cache`		bool	false	Enable result caching
`--cache-dir`		string	`.waza-cache`	Cache directory path
`--format`	`-f`	string	`default`	Output format: `default`, `github-comment`
`--reporter`		string[]		Output reporters: `json`, `junit:<path>` (repeatable)
`--timeout`		int	300	Task timeout in seconds
`--baseline`		bool	false	A/B testing mode — runs each task twice (without skill = baseline, with skill = normal) and computes improvement scores
`--update-snapshots`		bool	false	Update or create `diff` grader snapshot files to match current workspace output
`--discover`		string		Auto skill discovery — walks directory tree for SKILL.md + eval.yaml (root/tests/evals)
`--strict`		bool	false	Fail if any SKILL.md lacks eval coverage (use with `--discover`)
`--suggest`		bool	false	Generate a Copilot suggestion report based on test outcomes
`--recommend`		bool	false	Generate heuristic recommendation after multi-model run
`--session-log`		bool	false	Enable session event logging (NDJSON)
`--session-dir`		string	`.`	Directory for session log files
`--no-summary`		bool	false	Skip combined summary.json for multi-skill runs
`--skip-graders`		bool	false	Skip grading (execution only); grade later with `waza grade`
`--no-skills`		bool	false	Disable all skill loading for the evaluation
`--transcript-dir`		string		Save per-task transcript JSON files
`--no-cache`		bool	false	Explicitly disable result caching
`--keep-workspace`		bool	false	Preserve temp workspaces after execution for debugging
`--auto-file-issue`		bool	false	Auto-file or update a GitHub issue for failing runs (requires `gh` and `GITHUB_REPOSITORY`)
`--otel-exporter`		string		Export OpenTelemetry traces using `otlp`, `stdout`, or `file`. Off by default. See OpenTelemetry Tracing.
`--otel-endpoint`		string		OTLP endpoint (host:port or URL); only used with `--otel-exporter=otlp`
`--otel-headers`		string		Comma-separated `key=value` OTLP headers (e.g. for auth)
`--otel-file`		string		File path for span JSON when `--otel-exporter=file`
`--otel-include-payloads`		bool	false	Include prompt/tool-arg/tool-result/completion content in spans (default: redacted to `sha256`+length)
`--snapshot`		string		Capture a self-contained `snapshot.json` per task to the given directory for later replay. See `waza replay`.
`--snapshot-env-allow`		string[]		Allow-list of environment variable name patterns to embed in snapshots (default-deny; supports `WAZA_*` wildcards).
`--redact`		string		Path to a custom YAML redaction policy applied to snapshot output (merged with the built-in default rules).

Examples

# Run all tasks
waza run eval.yaml -v

# Run specific skill
waza run code-explainer

# Specify fixtures directory
waza run eval.yaml -c ./fixtures -v

# Save results
waza run eval.yaml -o results.json

# Filter to specific tasks
waza run eval.yaml --task "basic*" --task "edge*"

# Multiple models (parallel)
waza run eval.yaml --model gpt-4o --model claude-sonnet-4.6

# Use a different judge model for LLM-as-judge graders
waza run eval.yaml --model gpt-4o --judge-model claude-opus-4.6

# Run a Copilot SDK eval with top-level mcp_mocks fixtures and no live MCP service
waza run eval.yaml --session-log

# Parallel execution with 8 workers
waza run eval.yaml --parallel --workers 8

# With caching
waza run eval.yaml --cache --cache-dir .waza-cache

# Generate JUnit XML for CI test reporting
waza run eval.yaml --reporter junit:results.xml

# A/B testing: baseline vs skill performance
waza run eval.yaml --baseline -o results.json
# Output includes improvement breakdown (quality, tokens, turns, time, completion)

# Auto-update diff grader snapshots
waza run eval.yaml --update-snapshots

# Auto skill discovery
waza run --discover ./skills/

# Auto discovery with strict mode (fail if any SKILL.md lacks eval coverage)
waza run --discover --strict ./skills/

# Skip grading, then grade separately
waza run eval.yaml --skip-graders -o results.json
waza grade eval.yaml --results results.json

# Session event logging
waza run eval.yaml --session-log --session-dir ./logs

# Multi-model comparison with recommendation
waza run eval.yaml --model gpt-4o --model claude-sonnet-4.6 --recommend

# Keep temp workspaces for debugging fixture issues
waza run eval.yaml --keep-workspace -v

# Auto-file or update a triage issue on failures
GITHUB_REPOSITORY=microsoft/waza waza run eval.yaml --auto-file-issue

waza init

Initialize a new waza project.

waza init [directory]

Arguments

Argument	Description
`[directory]`	Project directory (default: current dir)

Flags

Flag	Description
`--no-skill`	Skip first skill creation prompt

Creates

project-root/
├── skills/
├── evals/
├── .github/workflows/eval.yml
├── .gitignore
└── README.md

Examples

waza init my-project
waza init my-project --no-skill

waza new skill

Create a new skill. In interactive mode, the wizard collects spec-aligned metadata: name, description, trigger phrases, and anti-trigger phrases.

waza new skill [skill-name]

Modes

Project mode (inside a skills/ directory):

cd my-skills-repo
waza new skill code-explainer
# Creates skills/code-explainer/SKILL.md + evals/code-explainer/

Standalone mode (no skills/ directory):

waza new skill my-skill
# Creates my-skill/ with all files

Flags

Flag	Description
`--template`	Template pack (coming soon)

Examples

# Interactive wizard
waza new skill code-analyzer

# Non-interactive (CI/CD)
waza new skill code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOF

waza new eval

Scaffold an eval suite for an existing skill.

waza new eval

Generate an eval scaffold from an existing SKILL.md.

waza new eval <skill-name>

Creates:

evals/<skill-name>/<files.evalFile>
evals/<skill-name>/tasks/positive-trigger-1<files.taskFileSuffix>
evals/<skill-name>/tasks/positive-trigger-2<files.taskFileSuffix>
evals/<skill-name>/tasks/negative-trigger-1<files.taskFileSuffix>

Flags

Flag	Description
`--output`	Custom path for the generated eval file; task YAMLs are created in a sibling `tasks/` directory next to this file.

The generated eval filename, task glob, and task filename suffix come from .waza.yaml files settings when configured.

Examples

# Generate eval scaffold using skills/my-skill/SKILL.md
waza new eval my-skill

# Write eval.yaml to a custom location
waza new eval my-skill --output evals/custom/my-skill-eval.yaml

waza new task

Generate task YAML files from a recorded prompt execution.

waza new task from-prompt

Run your prompt through Copilot and generate a task file with inferred validators from the recorded session.

waza new task from-prompt <prompt> <task-path>

Arguments

Argument	Description
`<prompt>`	Prompt to execute during recording
`<task-path>`	Output path for the generated task YAML

Flags

Flag	Type	Default	Description
`--model`	string	`claude-sonnet-4.5`	Copilot model to use for recording
`--testname`	string	`auto-generated-test`	Test name and test ID for generated task
`--tags`	string[]		Tags to add to generated task
`--timeout`	duration	`5m0s`	Max time to allow prompt completion
`--overwrite`	bool	`false`	Overwrite output file if it already exists
`--root`	string	`.`	Directory used to discover skills

Examples

# Generate a task from a recorded prompt run
waza new task from-prompt "Refactor this function for readability" evals/code-explainer/tasks/refactor-readability.yaml

# Add metadata and allow file replacement
waza new task from-prompt "Explain this diff and risks" evals/code-explainer/tasks/diff-analysis.yaml \
  --testname diff-analysis \
  --tags recorded,regression \
  --overwrite

waza check

Validate skill compliance and readiness.

waza check [skill-name | skill-path]

Arguments

Argument	Description
`[skill-name]`	Skill name (e.g., `code-explainer`)
`[skill-path]`	Path to skill directory
(none)	Check all skills in workspace

Flags

Flag	Description
`--verbose`	Detailed compliance report
`--format`	Output format: `text` (default), `json`

Output

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: code-explainer

📋 Compliance Score: High
   ✅ Excellent! Your skill meets all requirements.

📊 Token Budget: 420 / 500 tokens
   ✅ Within budget.

🧪 Evaluation Suite: Found
   ✅ eval.yaml detected.

💡 Advisory Checks
   ✅ scope-reduction: Capability scope: 3 signal(s) detected

✅ Your skill is ready for submission!

Advisory: Scope Reduction Warning

The scope-reduction advisory check detects when a SKILL.md may have lost workflow coverage due to token-limit compression. It parses three concrete signals:

Signal	What it counts
USE FOR items	Comma-separated phrases after `USE FOR:`
Level-2 headings	`##` headings in the body (distinct capability sections)
Numbered procedures	Sequences starting with `1.` (distinct workflows)

The check takes the maximum of these three indicators. If the count is below the minimum threshold (default: 2), a warning is emitted suggesting possible scope loss. This is advisory only — it does not block readiness.

Examples

waza check code-explainer
waza check ./skills/code-explainer
waza check --verbose

waza compare

Compare evaluation results across models.

waza compare [results-1.json] [results-2.json] ...

Arguments

Argument	Description
`[results-N.json]`	Result files to compare (2+ required)

Flags

Flag	Description
`--format`	Output format: `table` (default), `json`

Examples

waza compare gpt4.json sonnet.json
waza compare gpt4.json sonnet.json opus.json
waza compare results-*.json --format json

Tool-use metrics (schema 1.1+)

When the input files include the tool_events[] array (results.json schema 1.1 or later), waza compare prints an additional TOOL USE section with aggregate per-file metrics:

Metric	Meaning
`total_calls`	Total number of tool calls across every task and run
`tasks_with_tools`	Tasks where at least one tool call was recorded
`avg_calls_per_task`	Mean calls per task
`success_rate`	Fraction of tool calls that returned `success: true`
`selection_accuracy`	Fraction of tasks where the `tool_calls` grader passed (excludes tasks without a `tool_calls` grader)
`call_count_histogram`	Distribution of per-task call counts in buckets `0`, `1`, `2`, `3+`

The section is suppressed when none of the compared files contain tool data, so legacy 1.0 results remain unchanged.

waza replay

Replay a self-contained task snapshot to verify deterministic reproduction. Snapshots are produced by waza run --snapshot <dir> and contain the prompt, fixture digests, ordered tool events, environment allow-list, and redacted grader outcomes — everything required to re-derive the run without contacting the engine.

waza replay <snapshot.json> [flags]

Arguments

Argument	Description
`<snapshot.json>`	Path to a snapshot produced by `waza run --snapshot <dir>`

Flags

Flag	Type	Default	Description
`--mode`	string	`model-replay`	Replay mode. `model-replay` re-checks internal consistency without contacting the engine. `live` (planned) re-runs against the real engine.
`--bisect`	string		Path to a second snapshot to compare against the primary. Reports the first divergent turn.
`--json`	bool	false	Emit machine-readable JSON to stdout instead of a human summary.
`--strict`	bool	true	In `model-replay` mode, also re-check final status and grader outcome consistency.

Modes

model-replay — Re-checks the snapshot’s grader outcomes and tool-event tape for internal consistency (monotonic sequence, validation/score agreement). Fast, fully offline, ideal for CI.
live — Planned for Wave 4. Re-runs the task against the real engine and diffs the resulting tool events against the snapshot’s, ignoring durations and raw text.

Exit codes

Code	Meaning
`0`	Snapshots match / replay succeeded
`1`	Divergence detected (kind/value rendered to stderr or `--json` output)
`2`	Load, parse, schema-version, or I/O error

Examples

# Capture snapshots during a run
waza run eval.yaml --snapshot ./snapshots/

# Re-verify a single task's snapshot is internally consistent
waza replay ./snapshots/my-task-run1.json

# Diff two snapshots and find the first divergent turn (CI-friendly JSON)
waza replay ./snapshots/run-a.json --bisect ./snapshots/run-b.json --json

waza gate

Compare a current results file against a baseline and enforce regression policy. Designed for CI: emits stable exit codes and supports GitHub Actions annotations.

waza gate --baseline baseline.json --current results.json [flags]

Flags

Flag	Default	Description
`--baseline`	(required)	Baseline results JSON (the “known good” run).
`--current`	(required)	Current results JSON to evaluate.
`--max-regression-pct`	`0`	Maximum allowed drop in success rate (percentage points). `0` means no regression tolerated.
`--golden-must-pass`	`true`	If `true`, any failing task marked `golden: true` causes a golden failure (exit 2).
`--on-new-tasks`	`allow`	Policy when current has tasks not in baseline: `allow`, `warn`, `fail`.
`--on-removed-tasks`	`warn`	Policy when baseline has tasks not in current: `allow`, `warn`, `fail`.
`--format`	`human`	Output format: `human`, `json`, `markdown`, `github-actions`.

Exit codes

Code	Meaning
`0`	Pass — within thresholds and policies.
`1`	Regression — success rate dropped beyond `--max-regression-pct`, or a task-set policy with `fail` triggered.
`2`	Golden failure — at least one task marked `golden: true` did not pass in `--current`. Takes precedence over regression.
`3`	Config error — invalid flags, missing files, or unparseable JSON.

Golden tasks

Mark a task in your eval YAML as a golden task to enforce that it must always pass:

tasks:
  - id: critical-flow
    golden: true
    prompt: "..."
    grader:
      type: exact_match
      expected: "OK"

golden is propagated into results.json so waza gate can read it without re-reading the YAML. Detection is conservative: a task is treated as golden if either the baseline or the current run marks it golden.

Examples

# Basic CI gate: no regression allowed, golden must pass
waza gate --baseline baseline.json --current results.json

# Allow up to 5 percentage-point drop, fail on new tasks
waza gate --baseline baseline.json --current results.json \
  --max-regression-pct 5 --on-new-tasks fail

# Emit GitHub Actions annotations + step summary
waza gate --baseline baseline.json --current results.json \
  --format github-actions

# Markdown report (e.g. for posting as a PR comment)
waza gate --baseline baseline.json --current results.json \
  --format markdown > gate-report.md

See the CI/CD guide for full GitHub Actions and Azure DevOps snippets.

waza adversarial

Run offline adversarial / fault-injection packs against a skill. See the Adversarial harness guide for full coverage of the built-in packs and the adversarial: spec block.

waza adversarial [flags]

Every adversarial task is marked golden: true, so unsafe outcomes also flip waza gate to exit 2. The dedicated command additionally enforces an on_unsafe_outcome policy and prints a focused safety summary.

Flags

Flag	Default	Description
`--packs`	(all)	Comma-separated pack names. Defaults to every built-in pack.
`--list-packs`	`false`	Print the pack catalog (name, task count, description) and exit.
`--spec`	(none)	Inherit `adversarial:`, `graders`, and `metrics` from an `eval.yaml`.
`--on-unsafe-outcome`	`fail`	`fail` returns exit 2 on any unsafe outcome; `warn` returns 0.
`--engine`	(inferred)	`mock` or `copilot-sdk`. Defaults to `copilot-sdk`, or `mock` when `--skill` is unset.
`--skill`	(none)	Skill name to evaluate. Required for `--engine copilot-sdk`.
`--model`	`claude-sonnet-4-20250514`	Model id forwarded to the engine.
`--output`	(none)	Write the full `results.json` to a file.
`--workers`	`0`	Concurrent task workers (`0` = sequential).
`--parallel`	`false`	Enable parallel task execution.
`--keep-artifacts`	`false`	Keep the temp dir holding extracted packs and the synthesized `eval.yaml`.
`--artifacts-dir`	(temp)	Override the artifacts root. Implies `--keep-artifacts`.
`-v`, `--verbose`	`false`	Verbose progress output.

Exit codes

Code	Meaning
`0`	All adversarial tasks were refused safely (or `--on-unsafe-outcome warn`).
`2`	At least one unsafe outcome was observed, and policy is `fail`. Matches `waza gate` exit 2 so a single CI step gates both signals.
`3`	Configuration error — unknown pack, malformed spec, etc.

Examples

# List built-in packs
waza adversarial --list-packs

# Run every pack against a real skill
waza adversarial --skill ./skills/code-review --model gpt-4o

# Run a single pack and tolerate failures (non-blocking CI smoke)
waza adversarial --packs prompt-injection --on-unsafe-outcome warn

# Read pack selection + policy from eval.yaml
waza adversarial --spec eval.yaml --output adversarial.json

waza spec verify

Verify that eval.yaml covers the executable contract in SKILL.md.

waza spec verify [skill-path] [eval.yaml]

The command parses the skill description, USE FOR triggers, DO NOT USE FOR triggers, and parameter blocks into deterministic requirement IDs with source spans. It then loads the eval task files and reports which task IDs exercise each requirement.

Arguments

Argument	Description
`[skill-path]`	Path to `SKILL.md` or a skill directory
`[eval.yaml]`	Path to the eval spec

Flags

Flag	Description
`--skill`	Path to `SKILL.md` or a skill directory
`--eval`	Path to `eval.yaml`
`--format`	Output format: `human` (default), `json`, `github-actions`
`--warn`	Report uncovered requirements and exit 0 (default); set false to suppress GitHub Actions warning annotations
`--fail`	Exit 1 when uncovered requirements are greater than or equal to `--threshold`
`--threshold`	Uncovered requirement count threshold for `--fail` (default: `1`)
`--semantic`	Opt in to LLM-assisted semantic matching after deterministic matching
`--judge-model`	Judge model for `--semantic` (defaults to `config.judge_model`, then `config.model`)

Examples

# Human-readable report
waza spec verify skills/pr-summarizer evals/pr-summarizer/eval.yaml

# Machine-readable report
waza spec verify --skill skills/pr-summarizer --eval evals/pr-summarizer/eval.yaml --format json

# GitHub Actions annotations and non-zero exit on any uncovered requirement
waza spec verify skills/pr-summarizer evals/pr-summarizer/eval.yaml \
  --fail \
  --format github-actions

waza migrate

Check or migrate a public schema artifact.

waza migrate <file>

The current schema version is 1.0, so v1 eval.yaml and results.json files are already current and this command makes no file changes. When Waza introduces a future breaking schema, this command is where explicit major-version migrations will run.

Arguments

Argument	Description
`<file>`	Path to a schema artifact such as `eval.yaml` or `results.json`

Examples

waza migrate eval.yaml
waza migrate results.json

waza coverage

Generate an eval coverage grid for discovered skills and custom agents.

waza coverage [root]

Arguments

Argument	Description
`[root]`	Root directory to scan (default: current directory)

Flags

Flag	Description
`-f, --format`	Output format: `text` (default), `markdown`, `json`
`--path`	Additional directories to scan for skills/agents/evals (repeatable)

Coverage Levels

Full: Skill or agent has an eval.yaml/eval.yml with tasks (via tasks: or tasks_from:) and at least 2 distinct grader types.
Partial: Skill or agent has an eval.yaml/eval.yml but fewer than 2 grader types or no tasks.
Missing: No eval.yaml/eval.yml found for the skill or agent.

Note: Coverage now reports both SKILL.md (skills) and .agent.md (custom agents) files. The reported coverage percentage reflects only fully covered items (Fully Covered / Total Skills + Agents).

Examples

waza coverage
waza coverage --format markdown
waza coverage --format json
waza coverage --path custom-evals --path agents

waza suggest

Generate suggested eval artifacts from a skill’s SKILL.md using an LLM.

waza suggest <skill-path>

Flags

Flag	Description
`--model`	Model to use for suggestions (default: project default model)
`--dry-run`	Print suggestions to stdout (default)
`--apply`	Write suggested files to disk
`--force`	Allow `--apply` to overwrite existing eval/task/fixture files (requires `--apply`)
`--count <n>`	Generate exactly N tasks (default: at least 3 diverse + 1 negative)
`--focus <category>`	Steer generation toward one of `triggers`, `negative-triggers`, `edge-fixtures`, `do-not-use-for`, `parameters`
`--output-dir`	Output directory (default: `<skill-path>/evals`)
`--format`	Output format: `yaml` (default), `json`

Each suggested task entry carries a confidence value in [0, 1] and a rationale string citing the SKILL.md span it was derived from. Both appear in dry-run / JSON output, but are not written into the task YAML (the task schema rejects unknown fields).

--apply is merge-safe by default:

An existing eval.yaml is preserved — new task files are picked up by its existing tasks: glob pattern.
New task files refuse to overwrite existing task files (by path or by id) and print a diff between the existing and suggested task.
New fixtures refuse to overwrite existing fixtures.
Pass --force to overwrite existing eval / task / fixture files.

Examples

# Preview suggestions
waza suggest skills/code-explainer --dry-run

# Generate exactly 5 negative-trigger tasks and merge them into the existing suite
waza suggest skills/code-explainer --focus negative-triggers --count 5 --apply

# Write eval/task/fixture files
waza suggest skills/code-explainer --apply

# Overwrite a previously generated suite
waza suggest skills/code-explainer --apply --force

# JSON output
waza suggest skills/code-explainer --format json

Worked example: focused negative-trigger coverage

Use --focus negative-triggers when a skill’s DO NOT USE FOR section is underspecified by existing evals:

waza suggest skills/code-explainer --focus negative-triggers --count 2 --dry-run

The dry-run output keeps generation metadata outside the task YAML so the proposed case still validates against the task schema:

tasks:
  - path: tasks/negative-trigger-deploy.yaml
    confidence: 0.88
    rationale: "matches SKILL.md DO NOT USE FOR: deployment"
    content: |
      id: negative-trigger-deploy
      name: Does not trigger for deployment help
      inputs:
        prompt: "Deploy this app to production and configure rollout rings."
      expected:
        should_trigger: false

After review, merge the proposals:

waza suggest skills/code-explainer --focus negative-triggers --count 2 --apply

If a generated task would replace an existing task path or id, --apply stops and prints a diff. Re-run with --force only after confirming the suggested task should replace the curated one.

waza tokens

Token budget management.

waza tokens count

Count tokens in skill files.

waza tokens count [path]

waza tokens count skills/code-explainer/SKILL.md
waza tokens count skills/

waza tokens check

Check token usage against budget.

waza tokens check [skill-name]

waza tokens check code-explainer
# Output:
# code-explainer: 420 / 500 tokens (84%)
# ✅ Within budget

Token limits are resolved in priority order: .waza.yaml tokens.limits → .token-limits.json (deprecated; migrate to .waza.yaml) → built-in defaults. See the Token Limits guide for configuration details, pattern syntax, and migration instructions.

Configure per-file limits in .waza.yaml:

tokens:
  limits:
    defaults:
      "*.md": 2000
      "skills/**/SKILL.md": 5000
    overrides:
      "skills/complex-skill/SKILL.md": 7500

waza tokens compare

Compare markdown token counts between git refs. Supports general-purpose file-level comparison and skill-aware comparison with CI gating.

waza tokens compare [refs...]

# Compare HEAD to working tree (default)
waza tokens compare

# Compare a specific ref to working tree
waza tokens compare main

# Skill-aware comparison with CI threshold
waza tokens compare main --skills --threshold 10

# JSON output with strict absolute limits
waza tokens compare main --skills --threshold 10 --strict --format json

Flags: --format table|json, --show-unchanged, --strict, --skills, --threshold <percent>

Without --skills, compares all markdown files. With --skills, restricts comparison to SKILL.md files under configured skill roots (skills/, .github/skills/, APM .apm/skills/ outputs, and paths.skills from .waza.yaml). In skills mode the default base ref is origin/main (falling back to main).

--threshold sets a percentage-change gate for CI — newly added files are exempt from threshold checks (no baseline to compare) but still subject to absolute limit checks when --strict is set.

waza tokens profile

Structural analysis of SKILL.md files — reports token count, section count, code block count, and workflow step detection.

waza tokens profile [skill-name | path]

Flags: --format text|json, --tokenizer bpe|estimate

Example:

📊 my-skill: 1,722 tokens (detailed ✓), 8 sections, 4 code blocks
   ⚠️  no workflow steps detected

Warnings: no workflow steps, >2,500 tokens, fewer than 3 sections.

waza tokens suggest

Get optimization suggestions.

waza tokens suggest [skill-name]

Analyzes SKILL.md and suggests:

Sections to shorten
Removable content
Restructuring opportunities

waza results

Manage evaluation results from cloud storage or local storage.

waza results list

List evaluation runs from configured cloud storage.

waza results list
waza results list --limit 20
waza results list --format json

Flags

Flag	Description
`--limit <n>`	Maximum results to display (default: 10)
`--format`	Output format: `table` or `json` (default: `table`)

waza results compare

Compare two evaluation runs side by side.

waza results compare run-id-1 run-id-2
waza results compare run-id-1 run-id-2 --format json

Flags

Flag	Description
`--format`	Output format: `table` or `json` (default: `table`)

waza cache

Manage evaluation result cache.

waza cache clear [--cache-dir=.waza-cache]

Subcommands

waza cache clear

Clear all cached evaluation results.

The cache stores test outcomes to speed up repeated evaluations with the same inputs. Cached results are keyed by spec configuration, task definition, model, and fixture file contents.

waza cache clear
waza cache clear --cache-dir /path/to/cache

Flags

Flag	Description
`--cache-dir`	Cache directory to clear (default: `.waza-cache`)

Examples

# Clear default cache
waza cache clear

# Clear custom cache directory
waza cache clear --cache-dir .my-cache

waza dev

Improve skill compliance iteratively.

waza dev [skill-name | skill-path]

Flags

Flag	Description
`--target`	Target level: `low`, `medium`, `high`
`--max-iterations`	Max improvement loops (default: 5)
`--auto`	Auto-apply without prompting
`--fast`	Skip integration tests

Workflow

waza dev code-explainer --target high --auto

Iteratively:

Scores current compliance
Identifies issues
Suggests improvements
Applies changes
Re-scores
Repeats until target reached

waza grade

Re-grade previous evaluation results without re-executing the agent.

waza grade <eval.yaml>

Flags

Flag	Short	Type	Default	Description
`--results`		string		Required. Path to `waza run` output JSON
`--task`		string		Grade a specific task ID only
`--workspace`		string	`.`	Agent workspace directory for file-based graders
`--judge-model`		string		Model for prompt graders (overrides execution model)
`--output`	`-o`	string		Write full EvaluationOutcome JSON
`--verbose`	`-v`	bool	false	Verbose output

Examples

# Grade all tasks from a previous run
waza grade eval.yaml --results results.json

# Grade a specific task
waza grade eval.yaml --results results.json --task "basic-function"

# Use a different judge model
waza grade eval.yaml --results results.json --judge-model claude-opus-4.6

# Save graded results for comparison
waza grade eval.yaml --results results.json -o graded.json

waza session

Manage and inspect session event logs.

waza session list

List session event logs in a directory.

waza session list [--dir <path>]

Flag	Type	Default	Description
`--dir`	string	`.`	Directory to search for session logs

waza session view

Render a session timeline from an NDJSON event log.

waza session view <session-file>

waza serve

Start the interactive dashboard.

waza serve

Flags

Flag	Description
`--port`	Port (default: 3000)
`--tcp`	TCP address for JSON-RPC (e.g., `:9000`)
`--stdio`	Use stdin/stdout for piping

Examples

waza serve                    # http://localhost:3000
waza serve --port 8080        # http://localhost:8080
waza serve --tcp :9000        # JSON-RPC TCP server

Graders

Waza supports multiple grader types for comprehensive evaluation. See the complete Grader Reference for detailed documentation.

Built-in Graders

Grader	Purpose
`code`	Python/JavaScript assertion-based validation
`regex`	Pattern matching in output
`file`	File existence and content validation
`diff`	Workspace file comparison with snapshots
`behavior`	Agent behavior constraints (tool calls, tokens, duration)
`action_sequence`	Tool call sequence validation with F1 scoring
`skill_invocation`	Skill orchestration sequence validation
`prompt`	LLM-as-judge evaluation with rubrics
`tool_constraint`	Validate tool usage constraints (e.g., required/forbidden tools, argument patterns)
`trigger`	Prompt trigger accuracy — detects whether a prompt should activate a skill

tool_constraint Grader

Validate agent tool usage constraints during evaluation.

graders:
  - type: tool_constraint
    name: check_tools
    config:
      expect_tools:
        - tool: "bash"                         # Required tool call
          command_pattern: "azd\\s+up"         # Optional regex on the command argument
        - tool: "skill"
          skill_pattern: "my-skill"            # Optional regex on the skill argument
        - tool: "edit"
          path_pattern: "\\.go$"               # Optional regex on the path argument
      reject_tools:
        - tool: "bash"                         # Prohibited when args match this pattern
          command_pattern: "rm\\s+-rf"         # Optional regex on the command argument
        - tool: "create_file"                  # Always prohibited

All config fields are optional. Omitted fields skip that constraint.

prompt Grader with Pairwise Mode

Use the prompt grader for LLM-as-judge evaluation. In pairwise mode, compare two approaches side-by-side to reduce position bias.

graders:
  - type: prompt
    name: code_quality_judge
    config:
      mode: pairwise                        # Enable pairwise comparison (requires --baseline flag)
      rubric: "Compare these solutions for code quality and correctness"
      max_tokens: 500

Requirements:

Pairwise mode requires the --baseline flag on waza run
Baseline execution must complete before pairwise comparison runs
Each task is evaluated twice: once without the skill (baseline) and once with it (treatment)

Example:

waza run eval.yaml --baseline -o results.json
# Output includes pairwise judge scores comparing baseline vs treatment approaches

waza quality

Evaluate skill content quality using an LLM-as-Judge. Scores the skill across five dimensions: clarity, completeness, trigger precision, scope coverage, and anti-patterns.

waza quality <skill-path>

Flags

Flag	Type	Default	Description
`--model`	string	project default	Model to use as judge
`--format`	string	`table`	Output format: `table` or `json`
`--rubric`	string		Path to custom rubric file (reserved for future use)

Output

Displays a table with dimension scores (1-5), visual bars, and feedback:

DIMENSION          SCORE  FEEDBACK
──────────────────────────────────────────────────────────────────────
clarity            ████░  Instructions are clear and well-structured.
completeness       ███░░  Missing some edge case documentation.
trigger_precision  █████  USE FOR and DO NOT USE FOR are precise.
scope_coverage     ████░  Scope is well-defined with clear boundaries.
anti_patterns      ███░░  Some steps could be more specific.
──────────────────────────────────────────────────────────────────────
Overall: 3.8/5.0

A solid skill with good clarity and triggers.

Authentication

Requires Copilot authentication when using the default provider. If not authenticated, you will see:

Error: not authenticated — run "copilot login" first

Examples

# Table output (default)
waza quality skills/code-explainer

# JSON output for CI pipelines
waza quality skills/code-explainer --format json

# Use a specific judge model
waza quality skills/code-explainer --model gpt-4o

waza models

List models available for evaluation via the Copilot SDK.

waza models [flags]

Flags

Flag	Type	Default	Description
`--json`	bool	false	Output as JSON

Output

Displays a table with model ID, name, vision support, and context window size.

MODEL ID          NAME              VISION    CONTEXT WINDOW
──────────────────────────────────────────────────────────────
claude-sonnet-4   Claude Sonnet 4   no        200k
gpt-4o            GPT-4o            yes       128k

2 models available

Authentication

Requires Copilot authentication. Custom provider configuration only applies when creating or resuming Copilot SDK sessions. If not authenticated, you will see:

Error: not authenticated — run "copilot login" first

Examples

# List all models in table format
waza models

# List models as JSON (for scripting)
waza models --json

Exit Codes

Code	Meaning
`0`	Success
`1`	One or more tasks failed
`2`	Configuration or runtime error

Global Flags

Flag	Description
`--help`	Show help
`--version`	Show version
`--verbose`	Enable debug output
`--no-update-check`	Disable automatic version update check

Automatic Update Check

Waza checks for newer versions in the background when you run any command. If an update is available, a one-line notice is printed after the command output:

A newer version of waza is available: v0.24.0 → v0.28.0. Run: curl -fsSL ... | bash

The check is non-blocking — it never slows down your command.
Results are cached for 24 hours in ~/.waza/version-check.json.
Disable with --no-update-check or set WAZA_NO_UPDATE_CHECK=1.

Environment Variables

Variable	Description
`GITHUB_TOKEN`	Token for Copilot SDK execution
`COPILOT_CLI_PATH`	Explicit path to a GitHub Copilot CLI binary. When unset, Waza uses its bundled CLI and does not fall back to an unrelated `copilot` on `PATH`.
`COPILOT_BASE_URL` / `COPILOT_PROVIDER_BASE_URL`	Custom Copilot SDK provider endpoint. When set, Waza skips the default Copilot auth check and passes provider config to the SDK.
`COPILOT_PROVIDER` / `COPILOT_PROVIDER_TYPE`	Provider type passed through to the Copilot SDK.
`COPILOT_WIRE_API` / `COPILOT_PROVIDER_WIRE_API`	Wire format passed through to the Copilot SDK, for example `responses` or `completions`, depending on provider.
`COPILOT_API_KEY` / `COPILOT_PROVIDER_API_KEY`	API key for the custom provider, if required.
`COPILOT_BEARER_TOKEN` / `COPILOT_PROVIDER_BEARER_TOKEN`	Bearer token for the custom provider, if required.
`WAZA_HOME`	Config directory (default: `~/.waza`)
`WAZA_CACHE`	Cache directory (default: `.waza-cache`)
`WAZA_NO_UPDATE_CHECK`	Set to `1` to disable automatic version check

OpenTelemetry export is configured with the waza run --otel-* flags. v0.38.0 does not read WAZA_OTEL_* environment variables.

When a custom provider is active, usage output labels the SDK request counter as Provider Requests instead of Premium Requests. Result JSON records usage.provider: "custom" and a sanitized usage.provider_host; the full provider URL is not stored.

Next Steps

Writing Eval Specs — Create benchmarks
YAML Schema — eval.yaml format
GitHub Repository — Source code