Skip to content

How It Works

Vally follows one pipeline model from local development to production evaluation:

Stimulus → Executor → Trajectory → Graders → Score

Each component has a clear responsibility. This page explains them all.

Before running evals, the framework discovers eval files by scanning configured directories. By default, it finds files named eval.yaml or eval.yml in the evals/ directory.

You can customize this via .vally.yaml:

paths:
evals: [evals/, tests/evals/] # scan multiple directories
evalFilenames: ["eval.yaml", "*.eval.yaml"] # custom filename patterns

Suites can further scope which eval files are included using glob patterns in the evals field. See Authoring Eval Suites for details.

A Stimulus is a prompt — what you ask the agent to do — plus configuration for how to grade the result.

eval.yaml
stimuli:
- name: basic-test-generation
prompt: |
Write unit tests for this function:
function add(a, b) { return a + b; }
graders:
- type: file-exists
config:
path: "add.test.js"
- type: output-contains
config:
substring: "test"
constraints:
max_turns: 10
max_tokens: 5000

Think of stimuli as your test cases. Each one defines:

  • prompt — what to ask the agent
  • graders — how to check the result (your assertions)
  • constraints (optional) — resource limits

Stimuli can also define an environment: skills to load, files to stage, and commands to run before execution.

An Executor takes a stimulus and runs an agent, producing a trajectory. The executor is responsible for:

  • Setting up the workspace
  • Loading skills into the agent
  • Sending the prompt
  • Capturing all events (tool calls, messages, token usage)
  • Respecting timeout and constraint limits

The built-in executor uses the Copilot SDK:

interface Executor {
name: string;
execute(stimulus: Stimulus, options: ExecutorOptions): Promise<Trajectory>;
shutdown(): Promise<void>;
}

You can implement custom executors for other agent runtimes. The executor is a plugin point — swap it without changing your eval specs or graders.

A Trajectory is the behavioral record of a single run — a flat array of typed events:

interface Trajectory {
id: string;
stimulus: Stimulus;
events: TrajectoryEvent[]; // flat event log
metrics: TrajectoryMetrics; // computed aggregates
output: string; // final agent output
workDir: string;
metadata: TrajectoryMetadata;
}

Event types capture everything the agent did:

EventWhat it records
tool_callAgent invoked a tool (name, arguments)
tool_resultTool returned (success/failure, result)
token_usageLLM token counts (input, output, cache, model)
turn_start / turn_endConversation turn boundaries
assistant_messageAgent’s text response
user_messageUser/system message sent to agent
skill_activationA skill was loaded and activated
errorSomething went wrong

Metrics are computed from events:

MetricDerivation
tokenUsageSum of all token_usage events, broken down by model
toolCallCountCount of tool_call events
skillActivationCountCount of skill_activation events
turnCountCount of turn_end events
wallTimeMsTime from first to last event
errorCountCount of error events

A Grader evaluates some aspect of an eval and produces a result. Most graders inspect a captured trajectory; static graders can also inspect project assets directly without running an agent. Every grader implements one interface:

interface Grader {
metadata: GraderMetadata;
grade(input: GraderInput): Promise<GraderResult>;
}

The GraderResult is always the same shape:

interface GraderResult {
name: string;
kind: "code" | "llm" | "human";
passed: boolean;
score: number; // normalized [0, 1]
label?: string; // categorical: "correct", "incorrect", etc.
evidence: string; // human-readable explanation
details?: GraderResult[]; // sub-checks (for composite graders)
}

Graders range from simple (check if a file exists) to complex (ask an LLM to judge output quality). The built-in prompt grader sends the full trajectory to an LLM judge and evaluates against a rubric — it can assess things like code quality, explanation clarity, and task completeness that no static check can express.

What makes graders powerful is the taxonomy metadata — see Grader Taxonomy.

Example: a simple grader

Here’s the complete output-contains grader — one of the simplest built-in graders:

output-contains-grader.ts
export class OutputContainsGrader implements Grader {
metadata: GraderMetadata = {
name: "output-contains",
description: "Checks if output contains specific text",
costProfile: "low",
determinism: "static",
portability: "t1-universal",
reference: "reference-free",
temporalScope: "trajectory-level",
};
async grade(input: GraderInput): Promise<GraderResult> {
const caseSensitive = input.config.case_sensitive ?? false;
const output = caseSensitive ? input.trajectory.output : input.trajectory.output.toLowerCase();
const substring = caseSensitive ? input.config.substring : input.config.substring.toLowerCase();
const passed = output.includes(substring);
return {
name: this.metadata.name,
kind: "code",
passed,
score: passed ? 1 : 0,
evidence: passed
? `'${input.config.substring}' found in output`
: `'${input.config.substring}' NOT found in output`,
label: passed ? "correct" : "incorrect",
};
}
}

Grader results are aggregated into a score using configurable weights and a pass threshold:

scoring:
weights:
file-exists: 1.0
output-contains: 0.5
threshold: 0.7

For multiple trials (runs > 1), Vally computes multi-trial metrics using the unbiased estimator from Chen et al., 2021:

  • pass@k — probability of at least 1 success in k trials (unbiased combinatorial estimator)
  • pass^k — probability that ALL k trials succeed: p^k
  • flakiness — flags stimuli where some trials pass and others fail

Use --runs K on the CLI or set config.runs in your eval spec to enable multi-trial mode.

See Scoring for the full math.

┌─────────────────────────────────────────────────────────────────┐
│ eval.yaml │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ stimulus: "Write tests for add(a,b)" │ │
│ │ graders: [file-exists, output-contains] │ │
│ │ config: { runs: 3, model: gpt-5.5 } │ │
│ └────────┬─────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌────────────────────────────────────┐ │
│ │ Executor │───▶│ Trajectory (events + metrics) │ │
│ │ (Copilot SDK) │ │ tool_call, token_usage, output │ │
│ └─────────────────┘ └──────────┬─────────────────────────┘ │
│ │ │
│ ┌────────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │
│ │file-exists│ │output-contains│ │ (more) │ │
│ └─────┬────┘ └──────┬───────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Score: 0.85 (pass@3: 0.97) ✔ PASSED │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘