Skip to content

CLI Commands

Complete reference for all waza CLI commands and their options.

Terminal window
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
waza --version

Run an evaluation benchmark.

Terminal window
waza run [eval.yaml | skill-name]
ArgumentDescription
[eval.yaml]Path to evaluation spec file
[skill-name]Skill name (auto-detects eval.yaml)
(none)Auto-detect using workspace detection
FlagShortTypeDefaultDescription
--context-dir-cstring./fixturesFixtures directory path
--output-ostringSave results JSON to file
--output-dir-dstringDirectory for structured output; each run creates a UTC-timestamped subdirectory of the specified directory. Mutually exclusive with --output.
--verbose-vboolfalseDetailed progress output
--parallelboolfalseRun tasks concurrently
--workers-wint4Number of concurrent workers
--trialsintconfig.trials_per_taskRun each task N times for flakiness detection (omit to use config.trials_per_task; when provided, value must be >= 1)
--task-tstringFilter tasks by name (repeatable)
--tagsstringFilter tasks by tags (repeatable)
--model-mstringOverride model (repeatable)
--judge-modelstringModel for LLM-as-judge graders (overrides execution model)
--cacheboolfalseEnable result caching
--cache-dirstring.waza-cacheCache directory path
--format-fstringdefaultOutput format: default, github-comment
--reporterstring[]Output reporters: json, junit:<path> (repeatable)
--timeoutint300Task timeout in seconds
--baselineboolfalseA/B testing mode — runs each task twice (without skill = baseline, with skill = normal) and computes improvement scores
--update-snapshotsboolfalseUpdate or create diff grader snapshot files to match current workspace output
--discoverstringAuto skill discovery — walks directory tree for SKILL.md + eval.yaml (root/tests/evals)
--strictboolfalseFail if any SKILL.md lacks eval coverage (use with --discover)
--suggestboolfalseGenerate a Copilot suggestion report based on test outcomes
--recommendboolfalseGenerate heuristic recommendation after multi-model run
--session-logboolfalseEnable session event logging (NDJSON)
--session-dirstring.Directory for session log files
--no-summaryboolfalseSkip combined summary.json for multi-skill runs
--skip-gradersboolfalseSkip grading (execution only); grade later with waza grade
--no-skillsboolfalseDisable all skill loading for the evaluation
--transcript-dirstringSave per-task transcript JSON files
--no-cacheboolfalseExplicitly disable result caching
--keep-workspaceboolfalsePreserve temp workspaces after execution for debugging
Terminal window
# Run all tasks
waza run eval.yaml -v
# Run specific skill
waza run code-explainer
# Specify fixtures directory
waza run eval.yaml -c ./fixtures -v
# Save results
waza run eval.yaml -o results.json
# Filter to specific tasks
waza run eval.yaml --task "basic*" --task "edge*"
# Multiple models (parallel)
waza run eval.yaml --model gpt-4o --model claude-sonnet-4.6
# Use a different judge model for LLM-as-judge graders
waza run eval.yaml --model gpt-4o --judge-model claude-opus-4.6
# Parallel execution with 8 workers
waza run eval.yaml --parallel --workers 8
# With caching
waza run eval.yaml --cache --cache-dir .waza-cache
# Generate JUnit XML for CI test reporting
waza run eval.yaml --reporter junit:results.xml
# A/B testing: baseline vs skill performance
waza run eval.yaml --baseline -o results.json
# Output includes improvement breakdown (quality, tokens, turns, time, completion)
# Auto-update diff grader snapshots
waza run eval.yaml --update-snapshots
# Auto skill discovery
waza run --discover ./skills/
# Auto discovery with strict mode (fail if any SKILL.md lacks eval coverage)
waza run --discover --strict ./skills/
# Skip grading, then grade separately
waza run eval.yaml --skip-graders -o results.json
waza grade eval.yaml --results results.json
# Session event logging
waza run eval.yaml --session-log --session-dir ./logs
# Multi-model comparison with recommendation
waza run eval.yaml --model gpt-4o --model claude-sonnet-4.6 --recommend
# Keep temp workspaces for debugging fixture issues
waza run eval.yaml --keep-workspace -v

Initialize a new waza project.

Terminal window
waza init [directory]
ArgumentDescription
[directory]Project directory (default: current dir)
FlagDescription
--no-skillSkip first skill creation prompt
project-root/
├── skills/
├── evals/
├── .github/workflows/eval.yml
├── .gitignore
└── README.md
Terminal window
waza init my-project
waza init my-project --no-skill

Create a new skill.

Terminal window
waza new skill [skill-name]

Project mode (inside a skills/ directory):

Terminal window
cd my-skills-repo
waza new skill code-explainer
# Creates skills/code-explainer/SKILL.md + evals/code-explainer/

Standalone mode (no skills/ directory):

Terminal window
waza new skill my-skill
# Creates my-skill/ with all files
FlagDescription
--templateTemplate pack (coming soon)
Terminal window
# Interactive wizard
waza new skill code-analyzer
# Non-interactive (CI/CD)
waza new skill code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOF

Scaffold an eval suite for an existing skill.

Generate an eval scaffold from an existing SKILL.md.

Terminal window
waza new eval <skill-name>

Creates:

  • evals/<skill-name>/eval.yaml
  • evals/<skill-name>/tasks/positive-trigger-1.yaml
  • evals/<skill-name>/tasks/positive-trigger-2.yaml
  • evals/<skill-name>/tasks/negative-trigger-1.yaml
FlagDescription
--outputCustom path for generated eval.yaml; task YAMLs are created in a sibling tasks/ directory next to this file.
Terminal window
# Generate eval scaffold using skills/my-skill/SKILL.md
waza new eval my-skill
# Write eval.yaml to a custom location
waza new eval my-skill --output evals/custom/my-skill-eval.yaml

Generate task YAML files from a recorded prompt execution.

Run your prompt through Copilot and generate a task file with inferred validators from the recorded session.

Terminal window
waza new task from-prompt <prompt> <task-path>
ArgumentDescription
<prompt>Prompt to execute during recording
<task-path>Output path for the generated task YAML
FlagTypeDefaultDescription
--modelstringclaude-sonnet-4.5Copilot model to use for recording
--testnamestringauto-generated-testTest name and test ID for generated task
--tagsstring[]Tags to add to generated task
--timeoutduration5m0sMax time to allow prompt completion
--overwriteboolfalseOverwrite output file if it already exists
--rootstring.Directory used to discover skills
Terminal window
# Generate a task from a recorded prompt run
waza new task from-prompt "Refactor this function for readability" evals/code-explainer/tasks/refactor-readability.yaml
# Add metadata and allow file replacement
waza new task from-prompt "Explain this diff and risks" evals/code-explainer/tasks/diff-analysis.yaml \
--testname diff-analysis \
--tags recorded,regression \
--overwrite

Validate skill compliance and readiness.

Terminal window
waza check [skill-name | skill-path]
ArgumentDescription
[skill-name]Skill name (e.g., code-explainer)
[skill-path]Path to skill directory
(none)Check all skills in workspace
FlagDescription
--verboseDetailed compliance report
--formatOutput format: text (default), json
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: code-explainer
📋 Compliance Score: High
✅ Excellent! Your skill meets all requirements.
📊 Token Budget: 420 / 500 tokens
✅ Within budget.
🧪 Evaluation Suite: Found
✅ eval.yaml detected.
✅ Your skill is ready for submission!
Terminal window
waza check code-explainer
waza check ./skills/code-explainer
waza check --verbose

Compare evaluation results across models.

Terminal window
waza compare [results-1.json] [results-2.json] ...
ArgumentDescription
[results-N.json]Result files to compare (2+ required)
FlagDescription
--formatOutput format: table (default), json
Terminal window
waza compare gpt4.json sonnet.json
waza compare gpt4.json sonnet.json opus.json
waza compare results-*.json --format json

Generate an eval coverage grid for discovered skills.

Terminal window
waza coverage [root]
ArgumentDescription
[root]Root directory to scan (default: current directory)
FlagDescription
-f, --formatOutput format: text (default), markdown, json
--pathAdditional directories to scan for skills/evals (repeatable)
  • Full: Skill has an eval.yaml/eval.yml with tasks (via tasks: or tasks_from:) and at least 2 distinct grader types.
  • Partial: Skill has an eval.yaml/eval.yml but fewer than 2 grader types or no tasks.
  • Missing: No eval.yaml/eval.yml found for the skill.

Note: The reported coverage percentage reflects only fully covered skills (Fully Covered / Total Skills).

Terminal window
waza coverage
waza coverage --format markdown
waza coverage --format json
waza coverage --path custom-evals --path plugins

Generate suggested eval artifacts from a skill’s SKILL.md using an LLM.

Terminal window
waza suggest <skill-path>
FlagDescription
--modelModel to use for suggestions (default: project default model)
--dry-runPrint suggestions to stdout (default)
--applyWrite suggested files to disk
--output-dirOutput directory (default: <skill-path>/evals)
--formatOutput format: yaml (default), json
Terminal window
# Preview suggestions
waza suggest skills/code-explainer --dry-run
# Write eval/task/fixture files
waza suggest skills/code-explainer --apply
# JSON output
waza suggest skills/code-explainer --format json

Token budget management.

Count tokens in skill files.

Terminal window
waza tokens count [path]
Terminal window
waza tokens count skills/code-explainer/SKILL.md
waza tokens count skills/

Check token usage against budget.

Terminal window
waza tokens check [skill-name]
Terminal window
waza tokens check code-explainer
# Output:
# code-explainer: 420 / 500 tokens (84%)
# ✅ Within budget

Token limits are resolved in priority order: .waza.yaml tokens.limits.token-limits.json (deprecated; migrate to .waza.yaml) → built-in defaults. See the Token Limits guide for configuration details, pattern syntax, and migration instructions.

Configure per-file limits in .waza.yaml:

tokens:
limits:
defaults:
"*.md": 2000
"skills/**/SKILL.md": 5000
overrides:
"skills/complex-skill/SKILL.md": 7500

Compare markdown token counts between git refs. Supports general-purpose file-level comparison and skill-aware comparison with CI gating.

Terminal window
waza tokens compare [refs...]
Terminal window
# Compare HEAD to working tree (default)
waza tokens compare
# Compare a specific ref to working tree
waza tokens compare main
# Skill-aware comparison with CI threshold
waza tokens compare main --skills --threshold 10
# JSON output with strict absolute limits
waza tokens compare main --skills --threshold 10 --strict --format json

Flags: --format table|json, --show-unchanged, --strict, --skills, --threshold <percent>

Without --skills, compares all markdown files. With --skills, restricts comparison to SKILL.md files under configured skill roots (skills/, .github/skills/, and paths.skills from .waza.yaml). In skills mode the default base ref is origin/main (falling back to main).

--threshold sets a percentage-change gate for CI — newly added files are exempt from threshold checks (no baseline to compare) but still subject to absolute limit checks when --strict is set.

Structural analysis of SKILL.md files — reports token count, section count, code block count, and workflow step detection.

Terminal window
waza tokens profile [skill-name | path]

Flags: --format text|json, --tokenizer bpe|estimate

Example:

📊 my-skill: 1,722 tokens (detailed ✓), 8 sections, 4 code blocks
⚠️ no workflow steps detected

Warnings: no workflow steps, >2,500 tokens, fewer than 3 sections.

Get optimization suggestions.

Terminal window
waza tokens suggest [skill-name]

Analyzes SKILL.md and suggests:

  • Sections to shorten
  • Removable content
  • Restructuring opportunities

Manage evaluation results from cloud storage or local storage.

List evaluation runs from configured cloud storage.

Terminal window
waza results list
waza results list --limit 20
waza results list --format json
FlagDescription
--limit <n>Maximum results to display (default: 10)
--formatOutput format: table or json (default: table)

Compare two evaluation runs side by side.

Terminal window
waza results compare run-id-1 run-id-2
waza results compare run-id-1 run-id-2 --format json
FlagDescription
--formatOutput format: table or json (default: table)

Manage evaluation result cache.

Terminal window
waza cache clear [--cache-dir=.waza-cache]

Clear all cached evaluation results.

The cache stores test outcomes to speed up repeated evaluations with the same inputs. Cached results are keyed by spec configuration, task definition, model, and fixture file contents.

Terminal window
waza cache clear
waza cache clear --cache-dir /path/to/cache
FlagDescription
--cache-dirCache directory to clear (default: .waza-cache)
Terminal window
# Clear default cache
waza cache clear
# Clear custom cache directory
waza cache clear --cache-dir .my-cache

Improve skill compliance iteratively.

Terminal window
waza dev [skill-name | skill-path]
FlagDescription
--targetTarget level: low, medium, high
--max-iterationsMax improvement loops (default: 5)
--autoAuto-apply without prompting
--fastSkip integration tests
Terminal window
waza dev code-explainer --target high --auto

Iteratively:

  1. Scores current compliance
  2. Identifies issues
  3. Suggests improvements
  4. Applies changes
  5. Re-scores
  6. Repeats until target reached

Re-grade previous evaluation results without re-executing the agent.

Terminal window
waza grade <eval.yaml>
FlagShortTypeDefaultDescription
--resultsstringRequired. Path to waza run output JSON
--taskstringGrade a specific task ID only
--workspacestring.Agent workspace directory for file-based graders
--judge-modelstringModel for prompt graders (overrides execution model)
--output-ostringWrite full EvaluationOutcome JSON
--verbose-vboolfalseVerbose output
Terminal window
# Grade all tasks from a previous run
waza grade eval.yaml --results results.json
# Grade a specific task
waza grade eval.yaml --results results.json --task "basic-function"
# Use a different judge model
waza grade eval.yaml --results results.json --judge-model claude-opus-4.6
# Save graded results for comparison
waza grade eval.yaml --results results.json -o graded.json

Manage and inspect session event logs.

List session event logs in a directory.

Terminal window
waza session list [--dir <path>]
FlagTypeDefaultDescription
--dirstring.Directory to search for session logs

Render a session timeline from an NDJSON event log.

Terminal window
waza session view <session-file>

Start the interactive dashboard.

Terminal window
waza serve
FlagDescription
--portPort (default: 3000)
--tcpTCP address for JSON-RPC (e.g., :9000)
--stdioUse stdin/stdout for piping
Terminal window
waza serve # http://localhost:3000
waza serve --port 8080 # http://localhost:8080
waza serve --tcp :9000 # JSON-RPC TCP server

Waza supports multiple grader types for comprehensive evaluation. See the complete Grader Reference for detailed documentation.

GraderPurpose
codePython/JavaScript assertion-based validation
regexPattern matching in output
fileFile existence and content validation
diffWorkspace file comparison with snapshots
behaviorAgent behavior constraints (tool calls, tokens, duration)
action_sequenceTool call sequence validation with F1 scoring
skill_invocationSkill orchestration sequence validation
promptLLM-as-judge evaluation with rubrics
tool_constraintValidate tool usage constraints (e.g., required/forbidden tools, argument patterns)
triggerPrompt trigger accuracy — detects whether a prompt should activate a skill

Validate agent tool usage constraints during evaluation.

graders:
- type: tool_constraint
name: check_tools
config:
expect_tools:
- tool: "bash" # Required tool call
command_pattern: "azd\\s+up" # Optional regex on the command argument
- tool: "skill"
skill_pattern: "my-skill" # Optional regex on the skill argument
- tool: "edit"
path_pattern: "\\.go$" # Optional regex on the path argument
reject_tools:
- tool: "bash" # Prohibited when args match this pattern
command_pattern: "rm\\s+-rf" # Optional regex on the command argument
- tool: "create_file" # Always prohibited

All config fields are optional. Omitted fields skip that constraint.

Use the prompt grader for LLM-as-judge evaluation. In pairwise mode, compare two approaches side-by-side to reduce position bias.

graders:
- type: prompt
name: code_quality_judge
config:
mode: pairwise # Enable pairwise comparison (requires --baseline flag)
rubric: "Compare these solutions for code quality and correctness"
max_tokens: 500

Requirements:

  • Pairwise mode requires the --baseline flag on waza run
  • Baseline execution must complete before pairwise comparison runs
  • Each task is evaluated twice: once without the skill (baseline) and once with it (treatment)

Example:

Terminal window
waza run eval.yaml --baseline -o results.json
# Output includes pairwise judge scores comparing baseline vs treatment approaches

List models available for evaluation via the Copilot SDK.

Terminal window
waza models [flags]
FlagTypeDefaultDescription
--jsonboolfalseOutput as JSON

Displays a table with model ID, name, vision support, and context window size.

MODEL ID NAME VISION CONTEXT WINDOW
──────────────────────────────────────────────────────────────
claude-sonnet-4 Claude Sonnet 4 no 200k
gpt-4o GPT-4o yes 128k
2 models available

Requires Copilot authentication. If not authenticated, you will see:

Error: not authenticated — run "copilot login" first
Terminal window
# List all models in table format
waza models
# List models as JSON (for scripting)
waza models --json
CodeMeaning
0Success
1One or more tasks failed
2Configuration or runtime error
FlagDescription
--helpShow help
--versionShow version
--verboseEnable debug output
--no-update-checkDisable automatic version update check

Waza checks for newer versions in the background when you run any command. If an update is available, a one-line notice is printed after the command output:

A newer version of waza is available: v0.24.0 → v0.28.0. Run: curl -fsSL ... | bash
  • The check is non-blocking — it never slows down your command.
  • Results are cached for 24 hours in ~/.waza/version-check.json.
  • Disable with --no-update-check or set WAZA_NO_UPDATE_CHECK=1.
VariableDescription
GITHUB_TOKENToken for Copilot SDK execution
WAZA_HOMEConfig directory (default: ~/.waza)
WAZA_CACHECache directory (default: .waza-cache)
WAZA_NO_UPDATE_CHECKSet to 1 to disable automatic version check