Web Dashboard

The waza dashboard provides an interactive web interface for exploring evaluation results, comparing models, and tracking metrics.

Starting the Dashboard

waza serve

Opens http://localhost:3000 automatically.

Views

Dashboard Overview

Summary of all evaluation runs:

Recent Runs — List of recent evaluations with pass rate and model
Pass Rate Trend — Historical pass rate over time
Model Comparison — Side-by-side performance metrics
Top Failing Tasks — Tasks with lowest pass rate

Run Details

Detailed results for a single evaluation run:

Task List — All tasks with pass/fail status
Validator Results — Per-validator scores and messages
Execution Stats — Duration, token usage, tool calls
Trajectory Viewer — Two-mode trace inspection (see below)
Export — Download run as JSON

Trajectory Viewer

The trajectory viewer provides two ways to inspect agent execution:

Timeline view — An Aspire-inspired waterfall visualization. Each tool call renders as a horizontal bar spanning its event range, color-coded by status (green = pass, red = fail, yellow = pending). A summary header shows total spans and per-tool call counts (e.g., “bash × 4, edit × 2”). Click any span to open a detail sidebar with arguments, result, duration, and event range.

Events view — A linear transcript showing every event in order (turns, tool calls, errors, partial results). Each entry is expandable for full detail.

Toggle between views with the Timeline / Events buttons at the top of the panel.

Waterfall Timeline

The waterfall timeline arranges tool call spans on a horizontal axis proportional to event count. Features:

Status indicators — Green ✓, red ✗, yellow ⏳
Call indexing — Repeated tools show call number badges (bash #1, bash #2)
Span correlation — Start and complete events matched by toolCallId
Interleaved support — Handles concurrent or nested tool calls
Detail sidebar — Click a span to see arguments, result, and raw event data; press Escape to close

Compare View

Side-by-side comparison of two or more runs:

Model Metrics — Pass rate, average duration, tool call efficiency
Task Comparison — Which model passed each task
Validator Performance — Per-validator scores across models
Statistical Analysis — Confidence intervals, effect sizes

Trends

Historical metrics over time:

Pass Rate Trend — Model performance trending
Duration Trend — Execution speed over time
Task Coverage — Which tasks are consistently tested
Model Adoption — Usage patterns across models

Features

Live Updates

The Live view uses Server-Sent Events (SSE) to stream run progress from the dashboard API. The per-run endpoint is:

GET /api/v1/runs/{runId}/events
Accept: text/event-stream

Each message has an SSE id matching the JSON sequence field. Clients can reconnect with the Last-Event-ID header, or with the equivalent lastEventId query parameter, to resume after the last event they processed:

GET /api/v1/runs/{runId}/events?lastEventId=42

The stream replays existing events, remains open while the run is still progressing, and closes after run_completed or run_failed. Event payloads include schemaVersion, runId, type, timestamp, and a data object. Current event types are run_started, task_started, step_executed, task_completed, run_completed, and run_failed.

The legacy /api/events endpoint remains available and replays events for the newest run.

To view progress while iterating locally:

# Terminal 1: Serve dashboard
waza serve

# Terminal 2: Run evaluations
waza run eval.yaml -o results.json
# Dashboard refreshes automatically

Filtering

Filter results by:

Status — Passed / Failed
Tags — Task tags (e.g., “basic”, “edge-case”)
Date Range — Last 7 days, month, all-time
Model — Specific model only

Search

Full-text search across:

Task names and descriptions
Validator messages
Error messages
Transcripts

Export

Export data in multiple formats:

JSON — Complete result structure
CSV — Task results for spreadsheet analysis
PDF — Formatted report for sharing

Cost accuracy

The dashboard shows a per-run Cost value and an Avg Cost KPI. The number comes from one of three sources, in priority order:

Source	Tooltip label	When it applies
`sdk`	“Reported by Copilot SDK”	The Copilot SDK reported a per-request cost during execution (most accurate).
`table`	“Calculated from model rate table (as of 2025-01-01)”	Cost was computed from input/output/cache token counts × per-model rates for known Claude 3.5/4.x, GPT‑5 family, GPT‑4o/4.1, and Gemini 2.5 model IDs.
`estimate`	“Rough flat-rate estimate ($0.00025/token) — model pricing unavailable”	Fallback when the model isn’t in the rate table.
`mixed`	“Mixed sources across runs — hover individual rows for details”	The aggregate covers runs that were priced differently.

Hover the info icon next to any Cost label in the dashboard to see the source for that value. Per-row tooltips in the Runs table reveal the source for each individual run.

The rate table is a best-effort snapshot of published list pricing as of the effective date shown above and will drift over time. For authoritative invoicing, consult your provider billing.

Configuration

Dashboard Port

Change the default port (3000):

waza serve --port 8080

JSON-RPC Server

Run as a JSON-RPC TCP server instead of HTTP:

waza serve --tcp :9000

Connect from other applications using JSON-RPC 2.0 protocol.

Stdin/Stdout

Use stdio for piping:

waza serve --stdio

Result Format

Dashboard loads JSON results with this structure:

{
  "name": "code-explainer-eval",
  "model": "claude-sonnet-4.6",
  "timestamp": "2025-02-20T10:30:00Z",
  "pass_rate": 0.8,
  "duration_ms": 30000,
  "tasks": [
    {
      "id": "basic-001",
      "name": "Basic Usage",
      "passed": true,
      "duration_ms": 5000,
      "graders": [
        {
          "name": "checks_logic",
          "passed": true,
          "score": 1.0,
          "message": "All patterns matched"
        }
      ]
    }
  ]
}

Workflow

Local iteration with dashboard:

# Terminal 1: Start dashboard
cd my-eval-suite
waza serve

# Terminal 2: Run evaluations
waza run code-explainer -o results.json

# Terminal 3 (optional): Monitor results
# Dashboard auto-refreshes, or manually refresh in browser

Comparison workflow:

# Run with multiple models
waza run eval.yaml --model gpt-4o -o gpt4.json
waza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json

# View in dashboard
# Select both results for side-by-side comparison

Integration with CI/CD

Dashboard works with GitHub Actions:

Evaluation runs in CI generate results.json
Results uploaded as workflow artifact
Download artifact and open in dashboard:

# Download from GitHub
gh run download <run-id> -n results

# View in dashboard
waza serve

Troubleshooting

“Connection refused”

Dashboard not running. Start with waza serve.

Port already in use

Use a different port:

waza serve --port 8080

Results not loading

Ensure JSON is valid:

jq . results.json

Next Steps

CLI Reference — All commands
YAML Schema — eval.yaml format
GitHub Repository — Source and examples