CLI Commands

Complete reference for all waza CLI commands and their options.

Installation

curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash
waza --version

waza run

Run an evaluation benchmark.

waza run [eval.yaml | skill-name]

Arguments

Argument	Description
`[eval.yaml]`	Path to evaluation spec file
`[skill-name]`	Skill name (auto-detects eval.yaml)
(none)	Auto-detect using workspace detection

Flags

Flag	Short	Type	Default	Description
`--context-dir`	`-c`	string	`./fixtures`	Fixtures directory path
`--output`	`-o`	string		Save results JSON to file
`--output-dir`	`-d`	string		Save output artifacts to directory
`--verbose`	`-v`	bool	false	Detailed progress output
`--parallel`		bool	false	Run tasks concurrently
`--workers`	`-w`	int	4	Number of concurrent workers
`--trials`		int	`config.trials_per_task`	Run each task N times for flakiness detection (omit to use `config.trials_per_task`; when provided, value must be >= 1)
`--task`	`-t`	string		Filter tasks by name (repeatable)
`--tags`		string		Filter tasks by tags (repeatable)
`--model`	`-m`	string		Override model (repeatable)
`--judge-model`		string		Model for LLM-as-judge graders (overrides execution model)
`--cache`		bool	false	Enable result caching
`--cache-dir`		string	`.waza-cache`	Cache directory path
`--format`	`-f`	string	`default`	Output format: `default`, `github-comment`
`--reporter`		string[]		Output reporters: `json`, `junit:<path>` (repeatable)
`--timeout`		int	300	Task timeout in seconds
`--baseline`		bool	false	A/B testing mode — runs each task twice (without skill = baseline, with skill = normal) and computes improvement scores
`--update-snapshots`		bool	false	Update or create `diff` grader snapshot files to match current workspace output
`--discover`		string		Auto skill discovery — walks directory tree for SKILL.md + eval.yaml (root/tests/evals)
`--strict`		bool	false	Fail if any SKILL.md lacks eval coverage (use with `--discover`)

Examples

# Run all tasks
waza run eval.yaml -v

# Run specific skill
waza run code-explainer

# Specify fixtures directory
waza run eval.yaml -c ./fixtures -v

# Save results
waza run eval.yaml -o results.json

# Filter to specific tasks
waza run eval.yaml --task "basic*" --task "edge*"

# Multiple models (parallel)
waza run eval.yaml --model gpt-4o --model claude-sonnet-4.6

# Use a different judge model for LLM-as-judge graders
waza run eval.yaml --model gpt-4o --judge-model claude-opus-4.6

# Parallel execution with 8 workers
waza run eval.yaml --parallel --workers 8

# With caching
waza run eval.yaml --cache --cache-dir .waza-cache

# Generate JUnit XML for CI test reporting
waza run eval.yaml --reporter junit:results.xml

# A/B testing: baseline vs skill performance
waza run eval.yaml --baseline -o results.json
# Output includes improvement breakdown (quality, tokens, turns, time, completion)

# Auto-update diff grader snapshots
waza run eval.yaml --update-snapshots

# Auto skill discovery
waza run --discover ./skills/

# Auto discovery with strict mode (fail if any SKILL.md lacks eval coverage)
waza run --discover --strict ./skills/

waza init

Initialize a new waza project.

waza init [directory]

Arguments

Argument	Description
`[directory]`	Project directory (default: current dir)

Flags

Flag	Description
`--no-skill`	Skip first skill creation prompt

Creates

project-root/
├── skills/
├── evals/
├── .github/workflows/eval.yml
├── .gitignore
└── README.md

Examples

waza init my-project
waza init my-project --no-skill

waza new skill

Create a new skill.

waza new skill [skill-name]

Modes

Project mode (inside a skills/ directory):

cd my-skills-repo
waza new skill code-explainer
# Creates skills/code-explainer/SKILL.md + evals/code-explainer/

Standalone mode (no skills/ directory):

waza new skill my-skill
# Creates my-skill/ with all files

Flags

Flag	Description
`--template`	Template pack (coming soon)

Examples

# Interactive wizard
waza new skill code-analyzer

# Non-interactive (CI/CD)
waza new skill code-analyzer << EOF
Code Analyzer
Analyzes code for patterns and issues
code, analysis
EOF

waza new eval

Scaffold an eval suite for an existing skill.

waza new eval

Generate an eval scaffold from an existing SKILL.md.

waza new eval <skill-name>

Creates:

evals/<skill-name>/eval.yaml
evals/<skill-name>/tasks/positive-trigger-1.yaml
evals/<skill-name>/tasks/positive-trigger-2.yaml
evals/<skill-name>/tasks/negative-trigger-1.yaml

Flags

Flag	Description
`--output`	Custom path for generated `eval.yaml`; task YAMLs are created in a sibling `tasks/` directory next to this file.

Examples

# Generate eval scaffold using skills/my-skill/SKILL.md
waza new eval my-skill

# Write eval.yaml to a custom location
waza new eval my-skill --output evals/custom/my-skill-eval.yaml

waza new task

Generate task YAML files from a recorded prompt execution.

waza new task from-prompt

Run your prompt through Copilot and generate a task file with inferred validators from the recorded session.

waza new task from-prompt <prompt> <task-path>

Arguments

Argument	Description
`<prompt>`	Prompt to execute during recording
`<task-path>`	Output path for the generated task YAML

Flags

Flag	Type	Default	Description
`--model`	string	`claude-sonnet-4.5`	Copilot model to use for recording
`--testname`	string	`auto-generated-test`	Test name and test ID for generated task
`--tags`	string[]		Tags to add to generated task
`--timeout`	duration	`5m0s`	Max time to allow prompt completion
`--overwrite`	bool	`false`	Overwrite output file if it already exists
`--root`	string	`.`	Directory used to discover skills

Examples

# Generate a task from a recorded prompt run
waza new task from-prompt "Refactor this function for readability" evals/code-explainer/tasks/refactor-readability.yaml

# Add metadata and allow file replacement
waza new task from-prompt "Explain this diff and risks" evals/code-explainer/tasks/diff-analysis.yaml \
  --testname diff-analysis \
  --tags recorded,regression \
  --overwrite

waza check

Validate skill compliance and readiness.

waza check [skill-name | skill-path]

Arguments

Argument	Description
`[skill-name]`	Skill name (e.g., `code-explainer`)
`[skill-path]`	Path to skill directory
(none)	Check all skills in workspace

Flags

Flag	Description
`--verbose`	Detailed compliance report
`--format`	Output format: `text` (default), `json`

Output

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: code-explainer

📋 Compliance Score: High
   ✅ Excellent! Your skill meets all requirements.

📊 Token Budget: 420 / 500 tokens
   ✅ Within budget.

🧪 Evaluation Suite: Found
   ✅ eval.yaml detected.

✅ Your skill is ready for submission!

Examples

waza check code-explainer
waza check ./skills/code-explainer
waza check --verbose

waza compare

Compare evaluation results across models.

waza compare [results-1.json] [results-2.json] ...

Arguments

Argument	Description
`[results-N.json]`	Result files to compare (2+ required)

Flags

Flag	Description
`--format`	Output format: `table` (default), `json`

Examples

waza compare gpt4.json sonnet.json
waza compare gpt4.json sonnet.json opus.json
waza compare results-*.json --format json

waza suggest

Generate suggested eval artifacts from a skill’s SKILL.md using an LLM.

waza suggest <skill-path>

Flags

Flag	Description
`--model`	Model to use for suggestions (default: project default model)
`--dry-run`	Print suggestions to stdout (default)
`--apply`	Write suggested files to disk
`--output-dir`	Output directory (default: `<skill-path>/evals`)
`--format`	Output format: `yaml` (default), `json`

Examples

# Preview suggestions
waza suggest skills/code-explainer --dry-run

# Write eval/task/fixture files
waza suggest skills/code-explainer --apply

# JSON output
waza suggest skills/code-explainer --format json

waza tokens

Token budget management.

waza tokens count

Count tokens in skill files.

waza tokens count [path]

waza tokens count skills/code-explainer/SKILL.md
waza tokens count skills/

waza tokens check

Check token usage against budget.

waza tokens check [skill-name]

waza tokens check code-explainer
# Output:
# code-explainer: 420 / 500 tokens (84%)
# ✅ Within budget

Token limits are resolved in priority order: .waza.yaml tokens.limits → .token-limits.json (deprecated; migrate to .waza.yaml) → built-in defaults. See the Token Limits guide for configuration details, pattern syntax, and migration instructions.

Configure per-file limits in .waza.yaml:

tokens:
  limits:
    defaults:
      "*.md": 2000
      "skills/**/SKILL.md": 5000
    overrides:
      "skills/complex-skill/SKILL.md": 7500

waza tokens compare

Compare markdown token counts between git refs. Supports general-purpose file-level comparison and skill-aware comparison with CI gating.

waza tokens compare [refs...]

# Compare HEAD to working tree (default)
waza tokens compare

# Compare a specific ref to working tree
waza tokens compare main

# Skill-aware comparison with CI threshold
waza tokens compare main --skills --threshold 10

# JSON output with strict absolute limits
waza tokens compare main --skills --threshold 10 --strict --format json

Flags: --format table|json, --show-unchanged, --strict, --skills, --threshold <percent>

Without --skills, compares all markdown files. With --skills, restricts comparison to SKILL.md files under configured skill roots (skills/, .github/skills/, and paths.skills from .waza.yaml). In skills mode the default base ref is origin/main (falling back to main).

--threshold sets a percentage-change gate for CI — newly added files are exempt from threshold checks (no baseline to compare) but still subject to absolute limit checks when --strict is set.

waza tokens profile

Structural analysis of SKILL.md files — reports token count, section count, code block count, and workflow step detection.

waza tokens profile [skill-name | path]

Flags: --format text|json, --tokenizer bpe|estimate

Example:

📊 my-skill: 1,722 tokens (detailed ✓), 8 sections, 4 code blocks
   ⚠️  no workflow steps detected

Warnings: no workflow steps, >2,500 tokens, fewer than 3 sections.

waza tokens suggest

Get optimization suggestions.

waza tokens suggest [skill-name]

Analyzes SKILL.md and suggests:

Sections to shorten
Removable content
Restructuring opportunities

waza results

Manage evaluation results from cloud storage or local storage.

waza results list

List evaluation runs from configured cloud storage.

waza results list
waza results list --limit 20
waza results list --format json

Flags

Flag	Description
`--limit <n>`	Maximum results to display (default: 10)
`--format`	Output format: `table` or `json` (default: `table`)

waza results compare

Compare two evaluation runs side by side.

waza results compare run-id-1 run-id-2
waza results compare run-id-1 run-id-2 --format json

Flags

Flag	Description
`--format`	Output format: `table` or `json` (default: `table`)

waza dev

Improve skill compliance iteratively.

waza dev [skill-name | skill-path]

Flags

Flag	Description
`--target`	Target level: `low`, `medium`, `high`
`--max-iterations`	Max improvement loops (default: 5)
`--auto`	Auto-apply without prompting
`--fast`	Skip integration tests

Workflow

waza dev code-explainer --target high --auto

Iteratively:

Scores current compliance
Identifies issues
Suggests improvements
Applies changes
Re-scores
Repeats until target reached

waza serve

Start the interactive dashboard.

waza serve

Flags

Flag	Description
`--port`	Port (default: 3000)
`--tcp`	TCP address for JSON-RPC (e.g., `:9000`)
`--stdio`	Use stdin/stdout for piping

Examples

waza serve                    # http://localhost:3000
waza serve --port 8080        # http://localhost:8080
waza serve --tcp :9000        # JSON-RPC TCP server

Graders

Waza supports multiple grader types for comprehensive evaluation. See the complete Grader Reference for detailed documentation.

Built-in Graders

Grader	Purpose
`code`	Python/JavaScript assertion-based validation
`regex`	Pattern matching in output
`file`	File existence and content validation
`diff`	Workspace file comparison with snapshots
`behavior`	Agent behavior constraints (tool calls, tokens, duration)
`action_sequence`	Tool call sequence validation with F1 scoring
`skill_invocation`	Skill orchestration sequence validation
`prompt`	LLM-as-judge evaluation with rubrics
`tool_constraint`	Validate tool usage constraints (e.g., required/forbidden tools, argument patterns)
`trigger_tests`	Prompt trigger accuracy detection

tool_constraint Grader

Validate agent tool usage constraints during evaluation.

graders:
  - type: tool_constraint
    name: check_tools
    config:
      expect_tools:
        - tool: "bash"                         # Required tool call
          command_pattern: "azd\\s+up"         # Optional regex on the command argument
        - tool: "skill"
          skill_pattern: "my-skill"            # Optional regex on the skill argument
        - tool: "edit"
          path_pattern: "\\.go$"               # Optional regex on the path argument
      reject_tools:
        - tool: "bash"                         # Prohibited when args match this pattern
          command_pattern: "rm\\s+-rf"         # Optional regex on the command argument
        - tool: "create_file"                  # Always prohibited

All config fields are optional. Omitted fields skip that constraint.

prompt Grader with Pairwise Mode

Use the prompt grader for LLM-as-judge evaluation. In pairwise mode, compare two approaches side-by-side to reduce position bias.

graders:
  - type: prompt
    name: code_quality_judge
    config:
      mode: pairwise                        # Enable pairwise comparison (requires --baseline flag)
      rubric: "Compare these solutions for code quality and correctness"
      max_tokens: 500

Requirements:

Pairwise mode requires the --baseline flag on waza run
Baseline execution must complete before pairwise comparison runs
Each task is evaluated twice: once without the skill (baseline) and once with it (treatment)

Example:

waza run eval.yaml --baseline -o results.json
# Output includes pairwise judge scores comparing baseline vs treatment approaches

Exit Codes

Code	Meaning
`0`	Success
`1`	One or more tasks failed
`2`	Configuration or runtime error

Global Flags

Flag	Description
`--help`	Show help
`--version`	Show version
`--verbose`	Enable debug output

Environment Variables

Variable	Description
`GITHUB_TOKEN`	Token for Copilot SDK execution
`WAZA_HOME`	Config directory (default: `~/.waza`)
`WAZA_CACHE`	Cache directory (default: `.waza-cache`)

Next Steps

Writing Eval Specs — Create benchmarks
YAML Schema — eval.yaml format
GitHub Repository — Source code