Skip to content

Explore the Dashboard

The waza dashboard gives you a complete picture of your AI agent skill evaluations. Launch it with a single command and explore every view described below.

Terminal window
waza serve --results-dir ./results

The landing page shows all evaluation runs at a glance. Six KPI cards summarize your total runs, tasks, pass rate, token usage, cost, and duration. Below the cards, a sortable table lists every run with its spec, model, pass rate, weighted score, task count, tokens, cost, duration, and relative timestamp.

Eval Runs overview showing KPI cards and sortable runs table 🔍 Click to enlarge
  • KPI cards — Total Runs, Total Tasks, Pass Rate, Avg Tokens, Avg Cost, and Avg Duration update in real time as results load
  • Sortable columns — click any column header (Spec, Model, Tasks, Tokens, Cost, Duration, When) to sort ascending or descending
  • Judge model indicator — a ⚖ icon next to the model name indicates a separate judge model was used for grading
  • Export CSV — download the full runs table as a CSV file for offline analysis
  • Click any row to drill into that run’s detail view

Clicking a run opens its detail page. The Tasks tab shows every task in the evaluation with its outcome, raw score, weighted score, and duration. Expand any row to see individual grader results.

Run detail showing the Tasks tab with outcome badges and scores 🔍 Click to enlarge
  • Outcome badges — green pass and red fail badges make it easy to spot problems
  • Score vs. W. Score — the raw grader score and the weighted aggregate score are shown side by side
  • Statistical confidence — tasks with weighted scores display significance badges (✓ significant / ⚠ not significant) and confidence interval ranges
  • Export CSV — export task-level results for the current run

Switch to the Trajectory tab to see the agent’s execution path. The task selector lists every task with its pass/fail badge.

Trajectory tab showing the task selector with pass/fail badges 🔍 Click to enlarge
  • Task selector — every task listed as a clickable button with its outcome badge
  • Color-coded badges — green for pass, red for fail
  • Quick navigation — click any task to load its full trajectory

After selecting a task, the waterfall timeline shows a session digest and execution spans. Each span represents a tool call the agent made.

Waterfall timeline showing session digest and tool call spans 🔍 Click to enlarge
  • Session digest — Turns, Tool Calls, Tokens In/Out/Total, and Tools Used at a glance
  • Trace header — total span count and tool call breakdown (e.g., read_file × 2 create_file × 4 bash × 1)
  • Span bars — teal/emerald bars sized proportionally to duration, labeled with tool names
  • Timeline axis — seconds scale showing relative timing of each call

Click any span bar to open the detail panel with full tool call information.

Span detail panel showing tool name, attributes, arguments, and result 🔍 Click to enlarge
  • Tool name — which tool was called (bash, create_file, read_file, etc.)
  • Status badge — Passed/Failed indicator
  • Attributes — Duration, event count, event range, call ID
  • Arguments — the exact arguments passed to the tool (JSON formatted)
  • Result — expandable section with the tool’s return value

Toggle from Timeline to Events to see a chronological list of every event in the agent session.

Events list showing assistant turns and tool start/complete events 🔍 Click to enlarge
  • Assistant turns — the agent’s reasoning text before and after tool calls
  • Tool start/complete pairs — each tool call shown as start → complete with tool name and call ID
  • Show details — expand any event to see its full content
Event expanded showing the assistant's reasoning text 🔍 Click to enlarge
  • Expanded detail — full text of assistant reasoning or tool arguments/results
  • Collapsible — click “Hide details” to collapse back

Back on the Tasks tab, click any task row to expand and see individual grader results.

Task expanded showing per-grader results with weights and scores 🔍 Click to enlarge
  • Per-grader rows — each grader shows name, type, pass/fail, score, and weight
  • Weight display — grader weights shown as ×N multiplier badges
  • Weighted score — composite W. Score computed from individual grader weights

The Compare view lets you select any two runs and see their differences side by side. Select runs from the dropdowns and the comparison appears instantly.

Compare view showing side-by-side metrics and per-task comparison 🔍 Click to enlarge
  • Run cards — each selected run shows its spec, model, and timestamp
  • Metrics comparison — Pass Rate, Tokens, Cost, and Duration are compared with delta indicators (↑ increase in red, ↓ decrease in green)
  • Pass rate bars — horizontal bar chart visually compares pass rates
  • Per-task comparison — a table shows each task’s outcome, score, and duration for both runs, with Δ Score and Δ Duration columns highlighting differences
  • Click any row to view a trajectory diff between the two runs

The Trends page charts your evaluation metrics over time. Four line charts show Pass Rate, Tokens per Run, Cost per Run, and Duration per Run across all runs, ordered chronologically.

Trends page with pass rate, tokens, cost, and duration charts 🔍 Click to enlarge
  • Pass Rate — track whether your agent skills are improving or regressing
  • Tokens per Run — monitor token consumption trends to catch runaway prompts
  • Cost per Run — visualize spending patterns across evaluation runs
  • Duration per Run — spot performance regressions in execution time
  • Model filter — use the “Model” dropdown (top-right) to filter charts to a specific model (e.g., only gpt-4o runs)

The Live view connects to the waza server via WebSocket to show real-time evaluation progress. When no evaluation is running, it shows a disconnected state with instructions to start one.

Live view showing disconnected state with WebSocket status 🔍 Click to enlarge
  • WebSocket status — a red “Disconnected” badge (top-right) shows the current connection state; it turns green when an evaluation is running
  • Start prompt — when idle, the view shows a waza run command hint to start a new evaluation
  • Real-time updates — during an active run, tasks and grader results stream in as they complete
Terminal window
# Start a live evaluation and watch it in the dashboard
waza serve --results-dir ./results &
waza run eval.yaml --context-dir ./fixtures --live

The W. Score column in the Tasks table shows the weighted aggregate score for each task. When graders have different weights configured in the eval YAML, the weighted score reflects their relative importance.

Full task table showing weighted scores and confidence intervals 🔍 Click to enlarge
  • W. Score column — appears in both the Runs overview and Run Detail views
  • Per-grader weights — configure weights in your eval YAML under each grader’s weight field
  • Dash (—) for runs without weighted scoring configured — the column shows a dash when weights are not defined

Tasks with weighted scores display statistical significance indicators. These help you determine whether score differences between models or runs are meaningful or just noise.

Task rows showing significance badges and confidence interval ranges 🔍 Click to enlarge
  • ✓ significant (green) — the score is statistically significant with tight confidence intervals (e.g., [82.0%, 85.0%])
  • ⚠ not significant (yellow) — the score has wide confidence intervals (e.g., [45.0%, 90.0%]), meaning more data is needed
  • Confidence intervals — shown as [lower%, upper%] ranges below the weighted score
  • Actionable insight — significant scores validate your skill’s behavior; non-significant scores suggest you need more test cases or the grader criteria need refinement