Trajectory Format

A trajectory is the complete record of a single agent run. Saved as JSON via --output-dir, trajectories can be re-graded, compared, and analyzed.

Top-level structure

interface Trajectory {
  id: string; // Unique run identifier
  stimulus: Stimulus; // The prompt + config that produced this
  events: TrajectoryEvent[]; // Flat array of typed events
  metrics: TrajectoryMetrics; // Computed aggregates
  output: string; // Final agent output text
  workDir: string; // Workspace path during run
  metadata: TrajectoryMetadata; // Execution context
}

Events

Events are a discriminated union on the type field. Every event has a timestamp.

tool_call

Agent invoked a tool.

{
  "type": "tool_call",
  "timestamp": "2025-01-15T10:30:00.000Z",
  "data": {
    "toolName": "write_file",
    "toolCallId": "call_abc123",
    "arguments": { "path": "add.test.js", "content": "..." }
  }
}

When tool-call simulation intercepts a call, the event carries "simulated": true so graders and reports can distinguish it from a real invocation.

tool_result

Tool returned a result.

{
  "type": "tool_result",
  "timestamp": "2025-01-15T10:30:01.000Z",
  "data": {
    "toolName": "write_file",
    "toolCallId": "call_abc123",
    "success": true,
    "result": "File written successfully"
  }
}

token_usage

LLM token counts from a single API call.

{
  "type": "token_usage",
  "timestamp": "2025-01-15T10:30:00.500Z",
  "data": {
    "inputTokens": 1500,
    "outputTokens": 350,
    "model": "gpt-5.5",
    "cacheReadTokens": 200,
    "cacheWriteTokens": 0
  }
}

turn_start / turn_end

Conversation turn boundaries.

{ "type": "turn_start", "timestamp": "...", "data": { "turnId": "turn-1" } }
{ "type": "turn_end",   "timestamp": "...", "data": { "turnId": "turn-1" } }

assistant_message / user_message

Text messages in the conversation.

{ "type": "assistant_message", "timestamp": "...", "data": { "content": "I'll write tests for..." } }
{ "type": "user_message",      "timestamp": "...", "data": { "content": "Write tests for add()" } }

skill_activation

A skill was loaded by the agent.

{
  "type": "skill_activation",
  "timestamp": "...",
  "data": {
    "name": "test-writer",
    "path": "/skills/test-writer/SKILL.md",
    "pluginName": "copilot",
    "allowedTools": ["write_file", "read_file"]
  }
}

error

Something went wrong during the run.

{
  "type": "error",
  "timestamp": "...",
  "data": {
    "message": "Request timed out",
    "type": "TimeoutError",
    "code": 408
  }
}

Metrics

Computed from events after the run completes:

interface TrajectoryMetrics {
  tokenUsage: {
    inputTokens: number;
    outputTokens: number;
    totalTokens: number;
    cacheReadTokens: number;
    cacheWriteTokens: number;
    callCount: number;
    byModel: Record<
      string,
      {
        inputTokens: number;
        outputTokens: number;
        callCount: number;
      }
    >;
  };
  toolCallCount: number;
  toolCallBreakdown: Record<string, number>; // { "write_file": 3, "read_file": 1 }
  simulatedToolCallCount: number; // tool calls that were simulated, not run for real
  skillActivationCount: number;
  skillActivationBreakdown: Record<string, number>; // { "test-writer": 1 }
  turnCount: number;
  wallTimeMs: number;
  errorCount: number;
}

Metadata

interface TrajectoryMetadata {
  model: string; // Model used for execution
  skillsLoaded: string[]; // Names of skills that were loaded
  startedAt: Date;
  completedAt: Date;
  executor: string; // Which executor ran this
  sessionID: string;
}

Working with trajectories

# Save trajectories (one trial-result record per trial, in results.jsonl)
vally eval --eval-spec eval.yaml --output-dir ./vally-results

# Re-grade trajectories from a previous run
cat ./vally-results/*/results.jsonl | vally grade --eval-spec eval.yaml

# Inspect trial trajectories with jq. `results.jsonl` interleaves
# `trial-result` records (with a nested `trajectory` field) with a final
# `run-summary` line, so filter on `.type == "trial-result"` first.
cat ./vally-results/*/results.jsonl | jq 'select(.type == "trial-result") | .trajectory.metrics'
cat ./vally-results/*/results.jsonl | jq 'select(.type == "trial-result") | .trajectory.events[] | select(.type == "tool_call") | .data.toolName'
cat ./vally-results/*/results.jsonl | jq 'select(.type == "trial-result") | .trajectory.events | length'