Results API

The vally serve command exposes a REST API for querying eval results. The API is designed to be consumed by agents, scripts, and CI pipelines as well as the built-in dashboard.

Start a server:

vally serve ./vally-results

Conventions

Errors return { error: { code, message } } where code is a stable machine-readable string like not_found or invalid_params.

Pagination uses { page: { nextCursor, limit, hasMore, total } }. Pass ?cursor=<nextCursor>&limit=100 to page through results. Maximum limit is 1000.

Outcome status is a constrained enum: success, error, or skipped.

Metric definitions are available at GET /api/metrics. Each metric includes its id, unit (ratio, ms, count, tokens), direction (higher_is_better, lower_is_better, or neutral), and aggregation type (rate, mean, sum, p50, p95). Agents can use these to interpret values without external documentation.

Endpoints

Runs

`GET /api/runs`

List all runs. Returns { items: RunSummary[], page }.

Each RunSummary includes id, source, models, stimulusCount, outcomeCount, and passRate.

`GET /api/runs/:id`

Run detail. Adds stimuli (list of stimulus names) and config (eval configuration) to the run summary.

`GET /api/runs/:id/outcomes`

Outcomes scoped to this run. Same response shape as GET /api/outcomes with an implicit runId filter.

`GET /api/runs/:id/matrix?metric=pass_rate`

Score matrix for a run. Returns { rows, columns, cells, metric } where rows are stimulus names, columns are model names, and each cell has a value, outcomeCount, and optional flags.

The metric query parameter selects which metric to compute. Defaults to pass_rate. Use GET /api/metrics to discover available metrics.

`GET /api/runs/:id/ranking`

Model leaderboard. Returns { models: ModelRank[] } sorted by pass rate (descending), then duration (ascending). Each entry includes rank, passRate, meanDurationMs, meanTurns, meanToolCalls, token totals, and outcomeCount.

`GET /api/runs/:id/graders`

Grader failure analysis. Returns { graders: GraderStats[] } sorted by pass rate (ascending, worst first). Each entry includes passCount, failCount, passRate, meanScore, and up to three sampleFailureEvidence strings from actual failing outcomes.

`GET /api/runs/:id/compare?modelA=...&modelB=...`

Within-run model comparison. Returns per-stimulus pass/fail and scores for two models, plus a summary with agreementRate and divergentCount.

`GET /api/runs/:id/tools`

Tool usage analysis. Returns { tools: ToolStats[] } sorted by call count. Each entry includes callCount, successRate, outcomePassRate (pass rate of outcomes that used this tool), and models (which models used it).

Outcomes

`GET /api/outcomes`

List outcomes with filters. Supports query parameters:

Parameter	Description
`runId`	Filter by run
`model`	Filter by model name
`stimulus`	Filter by stimulus name
`status`	Filter by status (success/error/skipped)
`passed`	Filter by pass/fail (true/false)
`grader`	Filter by grader name
`limit`	Page size (max 1000)
`cursor`	Pagination cursor

Returns { items: OutcomeSummary[], page }. Each summary includes the outcome id, stimulusName, model, trialIndex, status, passed, score, and all metric values (duration, turns, tool calls, errors, tokens). It also includes graderSummary, an array of { name, passed, score } for each grader.

`GET /api/outcomes/:id`

Full outcome detail. Extends OutcomeSummary with graderResults (full GraderResult objects with evidence strings) and error (if the outcome errored).

`GET /api/outcomes/:id/trajectory`

Raw trajectory for an outcome. Returns the full Trajectory object with all events (tool calls, tool results, messages, errors, token usage).

`GET /api/outcomes/:id/tools`

Tool call events for an outcome. Returns { items: ToolEvent[], total } with per-call details: toolName, success, argsSizeBytes, resultSizeBytes.

Cross-run comparison

`GET /api/compare?runs=id1,id2,id3`

Compare outcomes across multiple runs. The comparison unit is a subject: a (run, model) pair. A single multi-model run produces multiple subjects.

Returns:

subjects with subjectId, runId, model, label, and outcomeCount
coverage showing how many stimuli overlap between subjects (intersectionCount, unionCount, and per-subject matched/total)
rows, one per stimulus, each containing a cells map from subjectId to a cell with status (ok/failed/error/missing), sampleSize, and aggregated metrics (passRate, meanDurationMs, meanTurns, meanToolCalls, errorCount, token averages)

Agent usage patterns

The API is structured as a funnel. Start broad, narrow to the problem:

GET /api/runs                            → pick the latest run
GET /api/runs/:id/ranking                → identify the worst model
GET /api/outcomes?model=...&passed=false  → find its failures
GET /api/outcomes/:id                    → read the grader evidence

For regression detection:

GET /api/compare?runs=<new>,<old>        → compare two runs
→ look for cells where status changed from "ok" to "failed"
→ coverage.intersectionCount tells you how comparable the runs are

For tool debugging:

GET /api/runs/:id/tools                  → find tools with low successRate
GET /api/outcomes?runId=...&limit=100    → check outcomes that used that tool