Results API
The vally serve command exposes a REST API for querying eval results. The API is designed to be consumed by agents, scripts, and CI pipelines as well as the built-in dashboard.
Start a server:
vally serve ./vally-resultsConventions
Section titled “Conventions”Errors return { error: { code, message } } where code is a stable machine-readable string like not_found or invalid_params.
Pagination uses { page: { nextCursor, limit, hasMore, total } }. Pass ?cursor=<nextCursor>&limit=100 to page through results. Maximum limit is 1000.
Outcome status is a constrained enum: success, error, or skipped.
Metric definitions are available at GET /api/metrics. Each metric includes its id, unit (ratio, ms, count, tokens), direction (higher_is_better, lower_is_better, or neutral), and aggregation type (rate, mean, sum, p50, p95). Agents can use these to interpret values without external documentation.
Endpoints
Section titled “Endpoints”GET /api/runs
Section titled “GET /api/runs”List all runs. Returns { items: RunSummary[], page }.
Each RunSummary includes id, source, models, stimulusCount, outcomeCount, and passRate.
GET /api/runs/:id
Section titled “GET /api/runs/:id”Run detail. Adds stimuli (list of stimulus names) and config (eval configuration) to the run summary.
GET /api/runs/:id/outcomes
Section titled “GET /api/runs/:id/outcomes”Outcomes scoped to this run. Same response shape as GET /api/outcomes with an implicit runId filter.
GET /api/runs/:id/matrix?metric=pass_rate
Section titled “GET /api/runs/:id/matrix?metric=pass_rate”Score matrix for a run. Returns { rows, columns, cells, metric } where rows are stimulus names, columns are model names, and each cell has a value, outcomeCount, and optional flags.
The metric query parameter selects which metric to compute. Defaults to pass_rate. Use GET /api/metrics to discover available metrics.
GET /api/runs/:id/ranking
Section titled “GET /api/runs/:id/ranking”Model leaderboard. Returns { models: ModelRank[] } sorted by pass rate (descending), then duration (ascending). Each entry includes rank, passRate, meanDurationMs, meanTurns, meanToolCalls, token totals, and outcomeCount.
GET /api/runs/:id/graders
Section titled “GET /api/runs/:id/graders”Grader failure analysis. Returns { graders: GraderStats[] } sorted by pass rate (ascending, worst first). Each entry includes passCount, failCount, passRate, meanScore, and up to three sampleFailureEvidence strings from actual failing outcomes.
GET /api/runs/:id/compare?modelA=...&modelB=...
Section titled “GET /api/runs/:id/compare?modelA=...&modelB=...”Within-run model comparison. Returns per-stimulus pass/fail and scores for two models, plus a summary with agreementRate and divergentCount.
GET /api/runs/:id/tools
Section titled “GET /api/runs/:id/tools”Tool usage analysis. Returns { tools: ToolStats[] } sorted by call count. Each entry includes callCount, successRate, outcomePassRate (pass rate of outcomes that used this tool), and models (which models used it).
Outcomes
Section titled “Outcomes”GET /api/outcomes
Section titled “GET /api/outcomes”List outcomes with filters. Supports query parameters:
| Parameter | Description |
|---|---|
runId | Filter by run |
model | Filter by model name |
stimulus | Filter by stimulus name |
status | Filter by status (success/error/skipped) |
passed | Filter by pass/fail (true/false) |
grader | Filter by grader name |
limit | Page size (max 1000) |
cursor | Pagination cursor |
Returns { items: OutcomeSummary[], page }. Each summary includes the outcome id, stimulusName, model, trialIndex, status, passed, score, and all metric values (duration, turns, tool calls, errors, tokens). It also includes graderSummary, an array of { name, passed, score } for each grader.
GET /api/outcomes/:id
Section titled “GET /api/outcomes/:id”Full outcome detail. Extends OutcomeSummary with graderResults (full GraderResult objects with evidence strings) and error (if the outcome errored).
GET /api/outcomes/:id/trajectory
Section titled “GET /api/outcomes/:id/trajectory”Raw trajectory for an outcome. Returns the full Trajectory object with all events (tool calls, tool results, messages, errors, token usage).
GET /api/outcomes/:id/tools
Section titled “GET /api/outcomes/:id/tools”Tool call events for an outcome. Returns { items: ToolEvent[], total } with per-call details: toolName, success, argsSizeBytes, resultSizeBytes.
Cross-run comparison
Section titled “Cross-run comparison”GET /api/compare?runs=id1,id2,id3
Section titled “GET /api/compare?runs=id1,id2,id3”Compare outcomes across multiple runs. The comparison unit is a subject: a (run, model) pair. A single multi-model run produces multiple subjects.
Returns:
subjectswithsubjectId,runId,model,label, andoutcomeCountcoverageshowing how many stimuli overlap between subjects (intersectionCount,unionCount, and per-subjectmatched/total)rows, one per stimulus, each containing acellsmap from subjectId to a cell withstatus(ok/failed/error/missing),sampleSize, and aggregated metrics (passRate,meanDurationMs,meanTurns,meanToolCalls,errorCount, token averages)
GET /api/metrics
Section titled “GET /api/metrics”Returns { metrics: MetricDefinition[] }. Each definition has id, label, unit, direction, aggregation, and description.
GET /api/health
Section titled “GET /api/health”Returns { status: "ok" }.
Agent usage patterns
Section titled “Agent usage patterns”The API is structured as a funnel. Start broad, narrow to the problem:
GET /api/runs → pick the latest runGET /api/runs/:id/ranking → identify the worst modelGET /api/outcomes?model=...&passed=false → find its failuresGET /api/outcomes/:id → read the grader evidenceFor regression detection:
GET /api/compare?runs=<new>,<old> → compare two runs→ look for cells where status changed from "ok" to "failed"→ coverage.intersectionCount tells you how comparable the runs areFor tool debugging:
GET /api/runs/:id/tools → find tools with low successRateGET /api/outcomes?runId=...&limit=100 → check outcomes that used that tool