How It Works

Vally follows one pipeline model from local development to production evaluation:

Stimulus  →  Executor  →  Trajectory  →  Graders  →  Score

Each component has a clear responsibility. This page explains them all.

The pipeline

Eval Discovery

Before running evals, the framework discovers eval files by scanning configured directories. By default, it finds files named eval.yaml or eval.yml in the evals/ directory.

You can customize this via .vally.yaml:

paths:
  evals: [evals/, tests/evals/] # scan multiple directories
  evalFilenames: ["eval.yaml", "*.eval.yaml"] # custom filename patterns

Suites can further scope which eval files are included using glob patterns in the evals field. See Authoring Eval Suites for details.

1. Stimulus

A Stimulus is a prompt — what you ask the agent to do — plus configuration for how to grade the result.

stimuli:
  - name: basic-test-generation
    prompt: |
      Write unit tests for this function:
      function add(a, b) { return a + b; }
    graders:
      - type: file-exists
        config:
          path: "add.test.js"
      - type: output-contains
        config:
          substring: "test"
    constraints:
      max_turns: 10
      max_tokens: 5000

Think of stimuli as your test cases. Each one defines:

prompt — what to ask the agent
graders — how to check the result (your assertions)
constraints (optional) — resource limits

Stimuli can also define an environment: skills to load, files to stage, and commands to run before execution.

2. Executor

An Executor takes a stimulus and runs an agent, producing a trajectory. The executor is responsible for:

Setting up the workspace
Loading skills into the agent
Sending the prompt
Capturing all events (tool calls, messages, token usage)
Respecting timeout and constraint limits

The built-in executor uses the Copilot SDK:

interface Executor {
  name: string;
  execute(stimulus: Stimulus, options: ExecutorOptions): Promise<Trajectory>;
  shutdown(): Promise<void>;
}

You can implement custom executors for other agent runtimes. The executor is a plugin point — swap it without changing your eval specs or graders.

3. Trajectory

A Trajectory is the behavioral record of a single run — a flat array of typed events:

interface Trajectory {
  id: string;
  stimulus: Stimulus;
  events: TrajectoryEvent[]; // flat event log
  metrics: TrajectoryMetrics; // computed aggregates
  output: string; // final agent output
  workDir: string;
  metadata: TrajectoryMetadata;
}

Event types capture everything the agent did:

Event	What it records
`tool_call`	Agent invoked a tool (name, arguments)
`tool_result`	Tool returned (success/failure, result)
`token_usage`	LLM token counts (input, output, cache, model)
`turn_start` / `turn_end`	Conversation turn boundaries
`assistant_message`	Agent’s text response
`user_message`	User/system message sent to agent
`skill_activation`	A skill was loaded and activated
`reasoning`	Agent’s reasoning/thinking output
`custom`	Executor-specific custom event data
`error`	Something went wrong

Metrics are computed from events:

Metric	Derivation
`tokenUsage`	Sum of all `token_usage` events, broken down by model
`toolCallCount`	Count of `tool_call` events
`skillActivationCount`	Count of `skill_activation` events
`turnCount`	Count of `turn_end` events
`wallTimeMs`	Time from first to last event
`errorCount`	Count of `error` events

4. Graders

A Grader evaluates some aspect of an eval and produces a result. Most graders inspect a captured trajectory; static graders can also inspect project assets directly without running an agent. Every grader implements one interface:

interface Grader {
  metadata: GraderMetadata;
  grade(input: GraderInput): Promise<GraderResult>;
}

The GraderResult is always the same shape:

interface GraderResult {
  name: string;
  kind: "code" | "llm" | "human";
  passed: boolean;
  score: number; // normalized [0, 1]
  label?: string; // categorical: "correct", "incorrect", etc.
  evidence: string; // human-readable explanation
  details?: GraderResult[]; // sub-checks (for composite graders)
  metadata?: Record<string, unknown>; // structured data for programmatic consumption
}

Graders range from simple (check if a file exists) to complex (ask an LLM to judge output quality). The built-in prompt grader sends the full trajectory to an LLM judge and evaluates against a rubric — it can assess things like code quality, explanation clarity, and task completeness that no static check can express.

What makes graders powerful is the taxonomy metadata — see Grader Taxonomy.

Example: a simple grader

Here’s a simplified version of the built-in output-contains grader — one of the simplest graders.

export class OutputContainsGrader implements Grader {
  metadata: GraderMetadata;

  constructor() {
    this.metadata = {
      name: "output-contains",
      description: "Checks if output contains specific text",
      behavior: { execution: "single" },
      costProfile: "free",
      determinism: "static",
      reference: "reference-free",
      temporalScope: "trajectory-level",
    };
  }

  async grade(input: GraderInput): Promise<GraderResult> {
    const caseSensitive = input.config.case_sensitive ?? false;
    const output = caseSensitive ? input.trajectory.output : input.trajectory.output.toLowerCase();
    const substring = caseSensitive ? input.config.substring : input.config.substring.toLowerCase();
    const passed = output.includes(substring);

    return {
      name: this.metadata.name,
      kind: "code",
      passed,
      score: passed ? 1 : 0,
      evidence: passed
        ? `'${input.config.substring}' found in output`
        : `'${input.config.substring}' NOT found in output`,
      label: passed ? "correct" : "incorrect",
    };
  }
}

5. Score

Grader results are aggregated into a score using configurable weights and a pass threshold:

scoring:
  weights:
    file-exists: 0.7
    output-contains: 0.3
  threshold: 0.7

For multiple trials (runs > 1), Vally computes multi-trial metrics using the unbiased estimator from Chen et al., 2021:

pass@k — probability of at least 1 success in k trials (unbiased combinatorial estimator)
pass^k — probability that ALL k trials succeed: p^k
flakiness — flags stimuli where some trials pass and others fail

Use --runs K on the CLI or set defaults.runs in your eval spec to enable multi-trial mode.

See Scoring for the full math.

The full picture

┌─────────────────────────────────────────────────────────────────┐
│                         eval.yaml                               │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ stimulus: "Write tests for add(a,b)"                     │   │
│  │ graders: [file-exists, output-contains]                  │   │
│  │ config: { runs: 3, model: gpt-5.5 }                       │   │
│  └────────┬─────────────────────────────────────────────────┘   │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐    ┌────────────────────────────────────┐  │
│  │    Executor      │───▶│  Trajectory (events + metrics)    │  │
│  │  (Copilot SDK)   │    │  tool_call, token_usage, output   │  │
│  └─────────────────┘    └──────────┬─────────────────────────┘  │
│                                     │                            │
│                    ┌────────────────┼───────────────┐            │
│                    ▼                ▼               ▼            │
│              ┌──────────┐   ┌──────────────┐  ┌──────────┐     │
│              │file-exists│   │output-contains│  │  (more)  │     │
│              └─────┬────┘   └──────┬───────┘  └────┬─────┘     │
│                    │               │                │            │
│                    ▼               ▼                ▼            │
│              ┌──────────────────────────────────────────┐       │
│              │  Score: 0.85 (pass@3: 0.97)  ✔ PASSED   │       │
│              └──────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────┘

Next steps

Grader taxonomy — how graders declare their properties
Writing custom graders — build your own