Skip to content

CLI Commands

Complete reference for all waza CLI commands and their options.

Terminal window
curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
waza --version

Run an evaluation benchmark.

Terminal window
waza run [eval.yaml | skill-name]
ArgumentDescription
[eval.yaml]Path to evaluation spec file
[skill-name]Skill name (auto-detects eval.yaml)
(none)Auto-detect using workspace detection
FlagShortTypeDefaultDescription
--context-dir-cstring./fixturesFixtures directory path
--output-ostringSave results JSON to file
--output-dir-dstringSave output artifacts to directory
--verbose-vboolfalseDetailed progress output
--parallelboolfalseRun tasks concurrently
--workers-wint4Number of concurrent workers
--trialsintconfig.trials_per_taskRun each task N times for flakiness detection (omit to use config.trials_per_task; when provided, value must be >= 1)
--task-tstringFilter tasks by name (repeatable)
--tagsstringFilter tasks by tags (repeatable)
--model-mstringOverride model (repeatable)
--judge-modelstringModel for LLM-as-judge graders (overrides execution model)
--cacheboolfalseEnable result caching
--cache-dirstring.waza-cacheCache directory path
--format-fstringdefaultOutput format: default, github-comment
--reporterstring[]Output reporters: json, junit:<path> (repeatable)
--timeoutint300Task timeout in seconds
--baselineboolfalseA/B testing mode — runs each task twice (without skill = baseline, with skill = normal) and computes improvement scores
--update-snapshotsboolfalseUpdate or create diff grader snapshot files to match current workspace output
--discoverstringAuto skill discovery — walks directory tree for SKILL.md + eval.yaml (root/tests/evals)
--strictboolfalseFail if any SKILL.md lacks eval coverage (use with --discover)
Terminal window
# Run all tasks
waza run eval.yaml -v
# Run specific skill
waza run code-explainer
# Specify fixtures directory
waza run eval.yaml -c ./fixtures -v
# Save results
waza run eval.yaml -o results.json
# Filter to specific tasks
waza run eval.yaml --task "basic*" --task "edge*"
# Multiple models (parallel)
waza run eval.yaml --model gpt-4o --model claude-sonnet-4.6
# Use a different judge model for LLM-as-judge graders
waza run eval.yaml --model gpt-4o --judge-model claude-opus-4.6
# Parallel execution with 8 workers
waza run eval.yaml --parallel --workers 8
# With caching
waza run eval.yaml --cache --cache-dir .waza-cache
# Generate JUnit XML for CI test reporting
waza run eval.yaml --reporter junit:results.xml
# A/B testing: baseline vs skill performance
waza run eval.yaml --baseline -o results.json
# Output includes improvement breakdown (quality, tokens, turns, time, completion)
# Auto-update diff grader snapshots
waza run eval.yaml --update-snapshots
# Auto skill discovery
waza run --discover ./skills/
# Auto discovery with strict mode (fail if any SKILL.md lacks eval coverage)
waza run --discover --strict ./skills/

Initialize a new waza project.

Terminal window
waza init [directory]
ArgumentDescription
[directory]Project directory (default: current dir)
FlagDescription
--no-skillSkip first skill creation prompt
project-root/
├── skills/
├── evals/
├── .github/workflows/eval.yml
├── .gitignore
└── README.md
Terminal window
waza init my-project
waza init my-project --no-skill

Create a new skill.

Terminal window
waza new skill [skill-name]

Project mode (inside a skills/ directory):

Terminal window
cd my-skills-repo
waza new skill code-explainer
# Creates skills/code-explainer/SKILL.md + evals/code-explainer/

Standalone mode (no skills/ directory):

Terminal window
waza new skill my-skill
# Creates my-skill/ with all files
FlagDescription
--templateTemplate pack (coming soon)
Terminal window
# Interactive wizard
waza new skill code-analyzer
# Non-interactive (CI/CD)
waza new skill code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOF

Scaffold an eval suite for an existing skill.

Generate an eval scaffold from an existing SKILL.md.

Terminal window
waza new eval <skill-name>

Creates:

  • evals/<skill-name>/eval.yaml
  • evals/<skill-name>/tasks/positive-trigger-1.yaml
  • evals/<skill-name>/tasks/positive-trigger-2.yaml
  • evals/<skill-name>/tasks/negative-trigger-1.yaml
FlagDescription
--outputCustom path for generated eval.yaml; task YAMLs are created in a sibling tasks/ directory next to this file.
Terminal window
# Generate eval scaffold using skills/my-skill/SKILL.md
waza new eval my-skill
# Write eval.yaml to a custom location
waza new eval my-skill --output evals/custom/my-skill-eval.yaml

Generate task YAML files from a recorded prompt execution.

Run your prompt through Copilot and generate a task file with inferred validators from the recorded session.

Terminal window
waza new task from-prompt <prompt> <task-path>
ArgumentDescription
<prompt>Prompt to execute during recording
<task-path>Output path for the generated task YAML
FlagTypeDefaultDescription
--modelstringclaude-sonnet-4.5Copilot model to use for recording
--testnamestringauto-generated-testTest name and test ID for generated task
--tagsstring[]Tags to add to generated task
--timeoutduration5m0sMax time to allow prompt completion
--overwriteboolfalseOverwrite output file if it already exists
--rootstring.Directory used to discover skills
Terminal window
# Generate a task from a recorded prompt run
waza new task from-prompt "Refactor this function for readability" evals/code-explainer/tasks/refactor-readability.yaml
# Add metadata and allow file replacement
waza new task from-prompt "Explain this diff and risks" evals/code-explainer/tasks/diff-analysis.yaml \
--testname diff-analysis \
--tags recorded,regression \
--overwrite

Validate skill compliance and readiness.

Terminal window
waza check [skill-name | skill-path]
ArgumentDescription
[skill-name]Skill name (e.g., code-explainer)
[skill-path]Path to skill directory
(none)Check all skills in workspace
FlagDescription
--verboseDetailed compliance report
--formatOutput format: text (default), json
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: code-explainer
📋 Compliance Score: High
✅ Excellent! Your skill meets all requirements.
📊 Token Budget: 420 / 500 tokens
✅ Within budget.
🧪 Evaluation Suite: Found
✅ eval.yaml detected.
✅ Your skill is ready for submission!
Terminal window
waza check code-explainer
waza check ./skills/code-explainer
waza check --verbose

Compare evaluation results across models.

Terminal window
waza compare [results-1.json] [results-2.json] ...
ArgumentDescription
[results-N.json]Result files to compare (2+ required)
FlagDescription
--formatOutput format: table (default), json
Terminal window
waza compare gpt4.json sonnet.json
waza compare gpt4.json sonnet.json opus.json
waza compare results-*.json --format json

Generate suggested eval artifacts from a skill’s SKILL.md using an LLM.

Terminal window
waza suggest <skill-path>
FlagDescription
--modelModel to use for suggestions (default: project default model)
--dry-runPrint suggestions to stdout (default)
--applyWrite suggested files to disk
--output-dirOutput directory (default: <skill-path>/evals)
--formatOutput format: yaml (default), json
Terminal window
# Preview suggestions
waza suggest skills/code-explainer --dry-run
# Write eval/task/fixture files
waza suggest skills/code-explainer --apply
# JSON output
waza suggest skills/code-explainer --format json

Token budget management.

Count tokens in skill files.

Terminal window
waza tokens count [path]
Terminal window
waza tokens count skills/code-explainer/SKILL.md
waza tokens count skills/

Check token usage against budget.

Terminal window
waza tokens check [skill-name]
Terminal window
waza tokens check code-explainer
# Output:
# code-explainer: 420 / 500 tokens (84%)
# ✅ Within budget

Token limits are resolved in priority order: .waza.yaml tokens.limits.token-limits.json (deprecated; migrate to .waza.yaml) → built-in defaults. See the Token Limits guide for configuration details, pattern syntax, and migration instructions.

Configure per-file limits in .waza.yaml:

tokens:
limits:
defaults:
"*.md": 2000
"skills/**/SKILL.md": 5000
overrides:
"skills/complex-skill/SKILL.md": 7500

Compare markdown token counts between git refs. Supports general-purpose file-level comparison and skill-aware comparison with CI gating.

Terminal window
waza tokens compare [refs...]
Terminal window
# Compare HEAD to working tree (default)
waza tokens compare
# Compare a specific ref to working tree
waza tokens compare main
# Skill-aware comparison with CI threshold
waza tokens compare main --skills --threshold 10
# JSON output with strict absolute limits
waza tokens compare main --skills --threshold 10 --strict --format json

Flags: --format table|json, --show-unchanged, --strict, --skills, --threshold <percent>

Without --skills, compares all markdown files. With --skills, restricts comparison to SKILL.md files under configured skill roots (skills/, .github/skills/, and paths.skills from .waza.yaml). In skills mode the default base ref is origin/main (falling back to main).

--threshold sets a percentage-change gate for CI — newly added files are exempt from threshold checks (no baseline to compare) but still subject to absolute limit checks when --strict is set.

Structural analysis of SKILL.md files — reports token count, section count, code block count, and workflow step detection.

Terminal window
waza tokens profile [skill-name | path]

Flags: --format text|json, --tokenizer bpe|estimate

Example:

📊 my-skill: 1,722 tokens (detailed ✓), 8 sections, 4 code blocks
⚠️ no workflow steps detected

Warnings: no workflow steps, >2,500 tokens, fewer than 3 sections.

Get optimization suggestions.

Terminal window
waza tokens suggest [skill-name]

Analyzes SKILL.md and suggests:

  • Sections to shorten
  • Removable content
  • Restructuring opportunities

Manage evaluation results from cloud storage or local storage.

List evaluation runs from configured cloud storage.

Terminal window
waza results list
waza results list --limit 20
waza results list --format json
FlagDescription
--limit <n>Maximum results to display (default: 10)
--formatOutput format: table or json (default: table)

Compare two evaluation runs side by side.

Terminal window
waza results compare run-id-1 run-id-2
waza results compare run-id-1 run-id-2 --format json
FlagDescription
--formatOutput format: table or json (default: table)

Improve skill compliance iteratively.

Terminal window
waza dev [skill-name | skill-path]
FlagDescription
--targetTarget level: low, medium, high
--max-iterationsMax improvement loops (default: 5)
--autoAuto-apply without prompting
--fastSkip integration tests
Terminal window
waza dev code-explainer --target high --auto

Iteratively:

  1. Scores current compliance
  2. Identifies issues
  3. Suggests improvements
  4. Applies changes
  5. Re-scores
  6. Repeats until target reached

Start the interactive dashboard.

Terminal window
waza serve
FlagDescription
--portPort (default: 3000)
--tcpTCP address for JSON-RPC (e.g., :9000)
--stdioUse stdin/stdout for piping
Terminal window
waza serve # http://localhost:3000
waza serve --port 8080 # http://localhost:8080
waza serve --tcp :9000 # JSON-RPC TCP server

Waza supports multiple grader types for comprehensive evaluation. See the complete Grader Reference for detailed documentation.

GraderPurpose
codePython/JavaScript assertion-based validation
regexPattern matching in output
fileFile existence and content validation
diffWorkspace file comparison with snapshots
behaviorAgent behavior constraints (tool calls, tokens, duration)
action_sequenceTool call sequence validation with F1 scoring
skill_invocationSkill orchestration sequence validation
promptLLM-as-judge evaluation with rubrics
tool_constraintValidate tool usage constraints (e.g., required/forbidden tools, argument patterns)
trigger_testsPrompt trigger accuracy detection

Validate agent tool usage constraints during evaluation.

graders:
- type: tool_constraint
name: check_tools
config:
expect_tools:
- tool: "bash" # Required tool call
command_pattern: "azd\\s+up" # Optional regex on the command argument
- tool: "skill"
skill_pattern: "my-skill" # Optional regex on the skill argument
- tool: "edit"
path_pattern: "\\.go$" # Optional regex on the path argument
reject_tools:
- tool: "bash" # Prohibited when args match this pattern
command_pattern: "rm\\s+-rf" # Optional regex on the command argument
- tool: "create_file" # Always prohibited

All config fields are optional. Omitted fields skip that constraint.

Use the prompt grader for LLM-as-judge evaluation. In pairwise mode, compare two approaches side-by-side to reduce position bias.

graders:
- type: prompt
name: code_quality_judge
config:
mode: pairwise # Enable pairwise comparison (requires --baseline flag)
rubric: "Compare these solutions for code quality and correctness"
max_tokens: 500

Requirements:

  • Pairwise mode requires the --baseline flag on waza run
  • Baseline execution must complete before pairwise comparison runs
  • Each task is evaluated twice: once without the skill (baseline) and once with it (treatment)

Example:

Terminal window
waza run eval.yaml --baseline -o results.json
# Output includes pairwise judge scores comparing baseline vs treatment approaches
CodeMeaning
0Success
1One or more tasks failed
2Configuration or runtime error
FlagDescription
--helpShow help
--versionShow version
--verboseEnable debug output
VariableDescription
GITHUB_TOKENToken for Copilot SDK execution
WAZA_HOMEConfig directory (default: ~/.waza)
WAZA_CACHECache directory (default: .waza-cache)