Web Dashboard
The waza dashboard provides an interactive web interface for exploring evaluation results, comparing models, and tracking metrics.
Starting the Dashboard
Section titled “Starting the Dashboard”waza serveOpens http://localhost:3000 automatically.
Dashboard Overview
Section titled “Dashboard Overview”Summary of all evaluation runs:
- Recent Runs — List of recent evaluations with pass rate and model
- Pass Rate Trend — Historical pass rate over time
- Model Comparison — Side-by-side performance metrics
- Top Failing Tasks — Tasks with lowest pass rate
Run Details
Section titled “Run Details”Detailed results for a single evaluation run:
- Task List — All tasks with pass/fail status
- Validator Results — Per-validator scores and messages
- Execution Stats — Duration, token usage, tool calls
- Trajectory Viewer — Two-mode trace inspection (see below)
- Export — Download run as JSON
Trajectory Viewer
Section titled “Trajectory Viewer”The trajectory viewer provides two ways to inspect agent execution:
Timeline view — An Aspire-inspired waterfall visualization. Each tool call renders as a horizontal bar spanning its event range, color-coded by status (green = pass, red = fail, yellow = pending). A summary header shows total spans and per-tool call counts (e.g., “bash × 4, edit × 2”). Click any span to open a detail sidebar with arguments, result, duration, and event range.
Events view — A linear transcript showing every event in order (turns, tool calls, errors, partial results). Each entry is expandable for full detail.
Toggle between views with the Timeline / Events buttons at the top of the panel.
Waterfall Timeline
Section titled “Waterfall Timeline”The waterfall timeline arranges tool call spans on a horizontal axis proportional to event count. Features:
- Status indicators — Green ✓, red ✗, yellow ⏳
- Call indexing — Repeated tools show call number badges (
bash #1,bash #2) - Span correlation — Start and complete events matched by
toolCallId - Interleaved support — Handles concurrent or nested tool calls
- Detail sidebar — Click a span to see arguments, result, and raw event data; press Escape to close
Compare View
Section titled “Compare View”Side-by-side comparison of two or more runs:
- Model Metrics — Pass rate, average duration, tool call efficiency
- Task Comparison — Which model passed each task
- Validator Performance — Per-validator scores across models
- Statistical Analysis — Confidence intervals, effect sizes
Trends
Section titled “Trends”Historical metrics over time:
- Pass Rate Trend — Model performance trending
- Duration Trend — Execution speed over time
- Task Coverage — Which tasks are consistently tested
- Model Adoption — Usage patterns across models
Features
Section titled “Features”Live Updates
Section titled “Live Updates”Dashboard auto-refreshes when new results are saved:
# Terminal 1: Serve dashboardwaza serve
# Terminal 2: Run evaluationswaza run eval.yaml -o results.json# Dashboard refreshes automaticallyFiltering
Section titled “Filtering”Filter results by:
- Status — Passed / Failed
- Tags — Task tags (e.g., “basic”, “edge-case”)
- Date Range — Last 7 days, month, all-time
- Model — Specific model only
Search
Section titled “Search”Full-text search across:
- Task names and descriptions
- Validator messages
- Error messages
- Transcripts
Export
Section titled “Export”Export data in multiple formats:
- JSON — Complete result structure
- CSV — Task results for spreadsheet analysis
- PDF — Formatted report for sharing
Cost accuracy
Section titled “Cost accuracy”The dashboard shows a per-run Cost value and an Avg Cost KPI. The number comes from one of three sources, in priority order:
| Source | Tooltip label | When it applies |
|---|---|---|
sdk | ”Reported by Copilot SDK” | The Copilot SDK reported a per-request cost during execution (most accurate). |
table | ”Calculated from model rate table (as of 2025-01-01)“ | Cost was computed from input/output/cache token counts × per-model rates for known Claude 3.5/4.x, GPT‑5 family, GPT‑4o/4.1, and Gemini 2.5 model IDs. |
estimate | ”Rough flat-rate estimate ($0.00025/token) — model pricing unavailable” | Fallback when the model isn’t in the rate table. |
mixed | ”Mixed sources across runs — hover individual rows for details” | The aggregate covers runs that were priced differently. |
Hover the info icon next to any Cost label in the dashboard to see the source for that value. Per-row tooltips in the Runs table reveal the source for each individual run.
The rate table is a best-effort snapshot of published list pricing as of the effective date shown above and will drift over time. For authoritative invoicing, consult your provider billing.
Configuration
Section titled “Configuration”Dashboard Port
Section titled “Dashboard Port”Change the default port (3000):
waza serve --port 8080JSON-RPC Server
Section titled “JSON-RPC Server”Run as a JSON-RPC TCP server instead of HTTP:
waza serve --tcp :9000Connect from other applications using JSON-RPC 2.0 protocol.
Stdin/Stdout
Section titled “Stdin/Stdout”Use stdio for piping:
waza serve --stdioResult Format
Section titled “Result Format”Dashboard loads JSON results with this structure:
{ "name": "code-explainer-eval", "model": "claude-sonnet-4.6", "timestamp": "2025-02-20T10:30:00Z", "pass_rate": 0.8, "duration_ms": 30000, "tasks": [ { "id": "basic-001", "name": "Basic Usage", "passed": true, "duration_ms": 5000, "graders": [ { "name": "checks_logic", "passed": true, "score": 1.0, "message": "All patterns matched" } ] } ]}Workflow
Section titled “Workflow”Local iteration with dashboard:
# Terminal 1: Start dashboardcd my-eval-suitewaza serve
# Terminal 2: Run evaluationswaza run code-explainer -o results.json
# Terminal 3 (optional): Monitor results# Dashboard auto-refreshes, or manually refresh in browserComparison workflow:
# Run with multiple modelswaza run eval.yaml --model gpt-4o -o gpt4.jsonwaza run eval.yaml --model claude-sonnet-4.6 -o sonnet.json
# View in dashboard# Select both results for side-by-side comparisonIntegration with CI/CD
Section titled “Integration with CI/CD”Dashboard works with GitHub Actions:
- Evaluation runs in CI generate
results.json - Results uploaded as workflow artifact
- Download artifact and open in dashboard:
# Download from GitHubgh run download <run-id> -n results
# View in dashboardwaza serveTroubleshooting
Section titled “Troubleshooting””Connection refused”
Section titled “”Connection refused””Dashboard not running. Start with waza serve.
Port already in use
Section titled “Port already in use”Use a different port:
waza serve --port 8080Results not loading
Section titled “Results not loading”Ensure JSON is valid:
jq . results.jsonNext Steps
Section titled “Next Steps”- CLI Reference — All commands
- YAML Schema — eval.yaml format
- GitHub Repository — Source and examples