Skip to content

Results API

The vally serve command exposes a REST API for querying eval results. The API is designed to be consumed by agents, scripts, and CI pipelines as well as the built-in dashboard.

Start a server:

3200/api/
vally serve ./vally-results

Errors return { error: { code, message } } where code is a stable machine-readable string like not_found or invalid_params.

Pagination uses { page: { nextCursor, limit, hasMore, total } }. Pass ?cursor=<nextCursor>&limit=100 to page through results. Maximum limit is 1000.

Outcome status is a constrained enum: success, error, or skipped.

Metric definitions are available at GET /api/metrics. Each metric includes its id, unit (ratio, ms, count, tokens), direction (higher_is_better, lower_is_better, or neutral), and aggregation type (rate, mean, sum, p50, p95). Agents can use these to interpret values without external documentation.

List all runs. Returns { items: RunSummary[], page }.

Each RunSummary includes id, source, models, stimulusCount, outcomeCount, and passRate.

Run detail. Adds stimuli (list of stimulus names) and config (eval configuration) to the run summary.

Outcomes scoped to this run. Same response shape as GET /api/outcomes with an implicit runId filter.

Score matrix for a run. Returns { rows, columns, cells, metric } where rows are stimulus names, columns are model names, and each cell has a value, outcomeCount, and optional flags.

The metric query parameter selects which metric to compute. Defaults to pass_rate. Use GET /api/metrics to discover available metrics.

Model leaderboard. Returns { models: ModelRank[] } sorted by pass rate (descending), then duration (ascending). Each entry includes rank, passRate, meanDurationMs, meanTurns, meanToolCalls, token totals, and outcomeCount.

Grader failure analysis. Returns { graders: GraderStats[] } sorted by pass rate (ascending, worst first). Each entry includes passCount, failCount, passRate, meanScore, and up to three sampleFailureEvidence strings from actual failing outcomes.

GET /api/runs/:id/compare?modelA=...&modelB=...

Section titled “GET /api/runs/:id/compare?modelA=...&modelB=...”

Within-run model comparison. Returns per-stimulus pass/fail and scores for two models, plus a summary with agreementRate and divergentCount.

Tool usage analysis. Returns { tools: ToolStats[] } sorted by call count. Each entry includes callCount, successRate, outcomePassRate (pass rate of outcomes that used this tool), and models (which models used it).

List outcomes with filters. Supports query parameters:

ParameterDescription
runIdFilter by run
modelFilter by model name
stimulusFilter by stimulus name
statusFilter by status (success/error/skipped)
passedFilter by pass/fail (true/false)
graderFilter by grader name
limitPage size (max 1000)
cursorPagination cursor

Returns { items: OutcomeSummary[], page }. Each summary includes the outcome id, stimulusName, model, trialIndex, status, passed, score, and all metric values (duration, turns, tool calls, errors, tokens). It also includes graderSummary, an array of { name, passed, score } for each grader.

Full outcome detail. Extends OutcomeSummary with graderResults (full GraderResult objects with evidence strings) and error (if the outcome errored).

Raw trajectory for an outcome. Returns the full Trajectory object with all events (tool calls, tool results, messages, errors, token usage).

Tool call events for an outcome. Returns { items: ToolEvent[], total } with per-call details: toolName, success, argsSizeBytes, resultSizeBytes.

Compare outcomes across multiple runs. The comparison unit is a subject: a (run, model) pair. A single multi-model run produces multiple subjects.

Returns:

  • subjects with subjectId, runId, model, label, and outcomeCount
  • coverage showing how many stimuli overlap between subjects (intersectionCount, unionCount, and per-subject matched/total)
  • rows, one per stimulus, each containing a cells map from subjectId to a cell with status (ok/failed/error/missing), sampleSize, and aggregated metrics (passRate, meanDurationMs, meanTurns, meanToolCalls, errorCount, token averages)

Returns { metrics: MetricDefinition[] }. Each definition has id, label, unit, direction, aggregation, and description.

Returns { status: "ok" }.

The API is structured as a funnel. Start broad, narrow to the problem:

GET /api/runs → pick the latest run
GET /api/runs/:id/ranking → identify the worst model
GET /api/outcomes?model=...&passed=false → find its failures
GET /api/outcomes/:id → read the grader evidence

For regression detection:

GET /api/compare?runs=<new>,<old> → compare two runs
→ look for cells where status changed from "ok" to "failed"
→ coverage.intersectionCount tells you how comparable the runs are

For tool debugging:

GET /api/runs/:id/tools → find tools with low successRate
GET /api/outcomes?runId=...&limit=100 → check outcomes that used that tool