Skip to content

Trajectory Format

A trajectory is the complete record of a single agent run. Saved as JSON via --output-dir, trajectories can be re-graded, compared, and analyzed.

interface Trajectory {
id: string; // Unique run identifier
stimulus: Stimulus; // The prompt + config that produced this
events: TrajectoryEvent[]; // Flat array of typed events
metrics: TrajectoryMetrics; // Computed aggregates
output: string; // Final agent output text
workDir: string; // Workspace path during run
metadata: TrajectoryMetadata; // Execution context
workspaceStatus?: WorkspaceStatus; // How the workspace was produced
}

Indicates how the trajectory’s workDir was produced. Used by the pipeline to determine whether file-based graders can run.

ValueMeaning
"local"Executor wrote files directly to the local workDir (default when absent)
"materialized"Remote executor synced files back via finalizeWorkspace()
"remote"Files only exist on the remote executor; file-based graders will error

Events are a discriminated union on the type field. Every event has a timestamp.

Agent invoked a tool.

{
"type": "tool_call",
"timestamp": "2025-01-15T10:30:00.000Z",
"data": {
"toolName": "write_file",
"toolCallId": "call_abc123",
"arguments": { "path": "add.test.js", "content": "..." }
}
}

Tool returned a result.

{
"type": "tool_result",
"timestamp": "2025-01-15T10:30:01.000Z",
"data": {
"toolName": "write_file",
"toolCallId": "call_abc123",
"success": true,
"result": "File written successfully"
}
}

LLM token counts from a single API call.

{
"type": "token_usage",
"timestamp": "2025-01-15T10:30:00.500Z",
"data": {
"inputTokens": 1500,
"outputTokens": 350,
"model": "gpt-5.5",
"cacheReadTokens": 200,
"cacheWriteTokens": 0
}
}

Conversation turn boundaries.

{ "type": "turn_start", "timestamp": "...", "data": { "turnId": "turn-1" } }
{ "type": "turn_end", "timestamp": "...", "data": { "turnId": "turn-1" } }

Text messages in the conversation.

{ "type": "assistant_message", "timestamp": "...", "data": { "content": "I'll write tests for..." } }
{ "type": "user_message", "timestamp": "...", "data": { "content": "Write tests for add()" } }

A skill was loaded by the agent.

{
"type": "skill_activation",
"timestamp": "...",
"data": {
"name": "test-writer",
"path": "/skills/test-writer/SKILL.md",
"pluginName": "copilot",
"allowedTools": ["write_file", "read_file"]
}
}

Something went wrong during the run.

{
"type": "error",
"timestamp": "...",
"data": {
"message": "Request timed out",
"type": "TimeoutError",
"code": 408
}
}

Computed from events after the run completes:

interface TrajectoryMetrics {
tokenUsage: {
inputTokens: number;
outputTokens: number;
totalTokens: number;
cacheReadTokens: number;
cacheWriteTokens: number;
callCount: number;
byModel: Record<
string,
{
inputTokens: number;
outputTokens: number;
callCount: number;
}
>;
};
toolCallCount: number;
toolCallBreakdown: Record<string, number>; // { "write_file": 3, "read_file": 1 }
skillActivationCount: number;
skillActivationBreakdown: Record<string, number>; // { "test-writer": 1 }
turnCount: number;
wallTimeMs: number;
errorCount: number;
}
interface TrajectoryMetadata {
model: string; // Model used for execution
skillsLoaded: string[]; // Names of skills that were loaded
startedAt: Date;
completedAt: Date;
executor: string; // Which executor ran this
sessionID: string;
}
Terminal window
# Save trajectories (one trial-result record per trial, in results.jsonl)
vally eval --eval-spec eval.yaml --output-dir ./results
# Re-grade trajectories from a previous run
cat ./results/*/results.jsonl | vally grade --eval-spec eval.yaml
# Inspect trial trajectories with jq. `results.jsonl` interleaves
# `trial-result` records (with a nested `trajectory` field) with a final
# `run-summary` line, so filter on `.type == "trial-result"` first.
cat ./results/*/results.jsonl | jq 'select(.type == "trial-result") | .trajectory.metrics'
cat ./results/*/results.jsonl | jq 'select(.type == "trial-result") | .trajectory.events[] | select(.type == "tool_call") | .data.toolName'
cat ./results/*/results.jsonl | jq 'select(.type == "trial-result") | .trajectory.events | length'