Skip to content

Web Dashboard

The waza dashboard provides an interactive web interface for exploring evaluation results, comparing models, and tracking metrics.

Terminal window
waza serve

Opens http://localhost:3000 automatically.

Summary of all evaluation runs:

  • Recent Runs — List of recent evaluations with pass rate and model
  • Pass Rate Trend — Historical pass rate over time
  • Model Comparison — Side-by-side performance metrics
  • Top Failing Tasks — Tasks with lowest pass rate

Detailed results for a single evaluation run:

  • Task List — All tasks with pass/fail status
  • Validator Results — Per-validator scores and messages
  • Execution Stats — Duration, token usage, tool calls
  • Trajectory Viewer — Two-mode trace inspection (see below)
  • Export — Download run as JSON

The trajectory viewer provides two ways to inspect agent execution:

Timeline view — An Aspire-inspired waterfall visualization. Each tool call renders as a horizontal bar spanning its event range, color-coded by status (green = pass, red = fail, yellow = pending). A summary header shows total spans and per-tool call counts (e.g., “bash × 4, edit × 2”). Click any span to open a detail sidebar with arguments, result, duration, and event range.

Events view — A linear transcript showing every event in order (turns, tool calls, errors, partial results). Each entry is expandable for full detail.

Toggle between views with the Timeline / Events buttons at the top of the panel.

The waterfall timeline arranges tool call spans on a horizontal axis proportional to event count. Features:

  • Status indicators — Green ✓, red ✗, yellow ⏳
  • Call indexing — Repeated tools show call number badges (bash #1, bash #2)
  • Span correlation — Start and complete events matched by toolCallId
  • Interleaved support — Handles concurrent or nested tool calls
  • Detail sidebar — Click a span to see arguments, result, and raw event data; press Escape to close

Side-by-side comparison of two or more runs:

  • Model Metrics — Pass rate, average duration, tool call efficiency
  • Task Comparison — Which model passed each task
  • Validator Performance — Per-validator scores across models
  • Statistical Analysis — Confidence intervals, effect sizes

Historical metrics over time:

  • Pass Rate Trend — Model performance trending
  • Duration Trend — Execution speed over time
  • Task Coverage — Which tasks are consistently tested
  • Model Adoption — Usage patterns across models

Dashboard auto-refreshes when new results are saved:

Terminal window
# Terminal 1: Serve dashboard
waza serve
# Terminal 2: Run evaluations
waza run eval.yaml -o results.json
# Dashboard refreshes automatically

Filter results by:

  • Status — Passed / Failed
  • Tags — Task tags (e.g., “basic”, “edge-case”)
  • Date Range — Last 7 days, month, all-time
  • Model — Specific model only

Full-text search across:

  • Task names and descriptions
  • Validator messages
  • Error messages
  • Transcripts

Export data in multiple formats:

  • JSON — Complete result structure
  • CSV — Task results for spreadsheet analysis
  • PDF — Formatted report for sharing

The dashboard shows a per-run Cost value and an Avg Cost KPI. The number comes from one of three sources, in priority order:

SourceTooltip labelWhen it applies
sdk”Reported by Copilot SDK”The Copilot SDK reported a per-request cost during execution (most accurate).
table”Calculated from model rate table (as of 2025-01-01)“Cost was computed from input/output/cache token counts × per-model rates for known Claude 3.5/4.x, GPT‑5 family, GPT‑4o/4.1, and Gemini 2.5 model IDs.
estimate”Rough flat-rate estimate ($0.00025/token) — model pricing unavailable”Fallback when the model isn’t in the rate table.
mixed”Mixed sources across runs — hover individual rows for details”The aggregate covers runs that were priced differently.

Hover the info icon next to any Cost label in the dashboard to see the source for that value. Per-row tooltips in the Runs table reveal the source for each individual run.

The rate table is a best-effort snapshot of published list pricing as of the effective date shown above and will drift over time. For authoritative invoicing, consult your provider billing.

Change the default port (3000):

Terminal window
waza serve --port 8080

Run as a JSON-RPC TCP server instead of HTTP:

Terminal window
waza serve --tcp :9000

Connect from other applications using JSON-RPC 2.0 protocol.

Use stdio for piping:

Terminal window
waza serve --stdio

Dashboard loads JSON results with this structure:

{
"name": "code-explainer-eval",
"model": "claude-sonnet-4.6",
"timestamp": "2025-02-20T10:30:00Z",
"pass_rate": 0.8,
"duration_ms": 30000,
"tasks": [
{
"id": "basic-001",
"name": "Basic Usage",
"passed": true,
"duration_ms": 5000,
"graders": [
{
"name": "checks_logic",
"passed": true,
"score": 1.0,
"message": "All patterns matched"
}
]
}
]
}

Local iteration with dashboard:

Terminal window
# Terminal 1: Start dashboard
cd my-eval-suite
waza serve
# Terminal 2: Run evaluations
waza run code-explainer -o results.json
# Terminal 3 (optional): Monitor results
# Dashboard auto-refreshes, or manually refresh in browser

Comparison workflow:

Terminal window
# Run with multiple models
waza run eval.yaml --model gpt-4o -o gpt4.json
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
# View in dashboard
# Select both results for side-by-side comparison

Dashboard works with GitHub Actions:

  1. Evaluation runs in CI generate results.json
  2. Results uploaded as workflow artifact
  3. Download artifact and open in dashboard:
Terminal window
# Download from GitHub
gh run download <run-id> -n results
# View in dashboard
waza serve

Dashboard not running. Start with waza serve.

Use a different port:

Terminal window
waza serve --port 8080

Ensure JSON is valid:

Terminal window
jq . results.json