How It Works
Vally follows one pipeline model from local development to production evaluation:
Stimulus → Executor → Trajectory → Graders → ScoreEach component has a clear responsibility. This page explains them all.
The pipeline
Section titled “The pipeline”Eval Discovery
Section titled “Eval Discovery”Before running evals, the framework discovers eval files by scanning configured directories. By default, it finds files named eval.yaml or eval.yml in the evals/ directory.
You can customize this via .vally.yaml:
paths: evals: [evals/, tests/evals/] # scan multiple directories evalFilenames: ["eval.yaml", "*.eval.yaml"] # custom filename patternsSuites can further scope which eval files are included using glob patterns in the evals field. See Authoring Eval Suites for details.
1. Stimulus
Section titled “1. Stimulus”A Stimulus is a prompt — what you ask the agent to do — plus configuration for how to grade the result.
stimuli: - name: basic-test-generation prompt: | Write unit tests for this function: function add(a, b) { return a + b; } graders: - type: file-exists config: path: "add.test.js" - type: output-contains config: substring: "test" constraints: max_turns: 10 max_tokens: 5000Think of stimuli as your test cases. Each one defines:
prompt— what to ask the agentgraders— how to check the result (your assertions)constraints(optional) — resource limits
Stimuli can also define an environment: skills to load, files to stage, and commands to run before execution.
2. Executor
Section titled “2. Executor”An Executor takes a stimulus and runs an agent, producing a trajectory. The executor is responsible for:
- Setting up the workspace
- Loading skills into the agent
- Sending the prompt
- Capturing all events (tool calls, messages, token usage)
- Respecting timeout and constraint limits
The built-in executor uses the Copilot SDK:
interface Executor { name: string; execute(stimulus: Stimulus, options: ExecutorOptions): Promise<Trajectory>; shutdown(): Promise<void>;}You can implement custom executors for other agent runtimes. The executor is a plugin point — swap it without changing your eval specs or graders.
3. Trajectory
Section titled “3. Trajectory”A Trajectory is the behavioral record of a single run — a flat array of typed events:
interface Trajectory { id: string; stimulus: Stimulus; events: TrajectoryEvent[]; // flat event log metrics: TrajectoryMetrics; // computed aggregates output: string; // final agent output workDir: string; metadata: TrajectoryMetadata;}Event types capture everything the agent did:
| Event | What it records |
|---|---|
tool_call | Agent invoked a tool (name, arguments) |
tool_result | Tool returned (success/failure, result) |
token_usage | LLM token counts (input, output, cache, model) |
turn_start / turn_end | Conversation turn boundaries |
assistant_message | Agent’s text response |
user_message | User/system message sent to agent |
skill_activation | A skill was loaded and activated |
error | Something went wrong |
Metrics are computed from events:
| Metric | Derivation |
|---|---|
tokenUsage | Sum of all token_usage events, broken down by model |
toolCallCount | Count of tool_call events |
skillActivationCount | Count of skill_activation events |
turnCount | Count of turn_end events |
wallTimeMs | Time from first to last event |
errorCount | Count of error events |
4. Graders
Section titled “4. Graders”A Grader evaluates some aspect of an eval and produces a result. Most graders inspect a captured trajectory; static graders can also inspect project assets directly without running an agent. Every grader implements one interface:
interface Grader { metadata: GraderMetadata; grade(input: GraderInput): Promise<GraderResult>;}The GraderResult is always the same shape:
interface GraderResult { name: string; kind: "code" | "llm" | "human"; passed: boolean; score: number; // normalized [0, 1] label?: string; // categorical: "correct", "incorrect", etc. evidence: string; // human-readable explanation details?: GraderResult[]; // sub-checks (for composite graders)}Graders range from simple (check if a file exists) to complex (ask an LLM to judge output quality). The built-in prompt grader sends the full trajectory to an LLM judge and evaluates against a rubric — it can assess things like code quality, explanation clarity, and task completeness that no static check can express.
What makes graders powerful is the taxonomy metadata — see Grader Taxonomy.
Example: a simple grader
Here’s the complete output-contains grader — one of the simplest built-in graders:
export class OutputContainsGrader implements Grader { metadata: GraderMetadata = { name: "output-contains", description: "Checks if output contains specific text", costProfile: "low", determinism: "static", portability: "t1-universal", reference: "reference-free", temporalScope: "trajectory-level", };
async grade(input: GraderInput): Promise<GraderResult> { const caseSensitive = input.config.case_sensitive ?? false; const output = caseSensitive ? input.trajectory.output : input.trajectory.output.toLowerCase(); const substring = caseSensitive ? input.config.substring : input.config.substring.toLowerCase(); const passed = output.includes(substring);
return { name: this.metadata.name, kind: "code", passed, score: passed ? 1 : 0, evidence: passed ? `'${input.config.substring}' found in output` : `'${input.config.substring}' NOT found in output`, label: passed ? "correct" : "incorrect", }; }}5. Score
Section titled “5. Score”Grader results are aggregated into a score using configurable weights and a pass threshold:
scoring: weights: file-exists: 1.0 output-contains: 0.5 threshold: 0.7For multiple trials (runs > 1), Vally computes multi-trial metrics using the unbiased estimator from Chen et al., 2021:
- pass@k — probability of at least 1 success in k trials (unbiased combinatorial estimator)
- pass^k — probability that ALL k trials succeed:
p^k - flakiness — flags stimuli where some trials pass and others fail
Use --runs K on the CLI or set config.runs in your eval spec to enable multi-trial mode.
See Scoring for the full math.
The full picture
Section titled “The full picture”┌─────────────────────────────────────────────────────────────────┐│ eval.yaml ││ ┌──────────────────────────────────────────────────────────┐ ││ │ stimulus: "Write tests for add(a,b)" │ ││ │ graders: [file-exists, output-contains] │ ││ │ config: { runs: 3, model: gpt-5.5 } │ ││ └────────┬─────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────┐ ┌────────────────────────────────────┐ ││ │ Executor │───▶│ Trajectory (events + metrics) │ ││ │ (Copilot SDK) │ │ tool_call, token_usage, output │ ││ └─────────────────┘ └──────────┬─────────────────────────┘ ││ │ ││ ┌────────────────┼───────────────┐ ││ ▼ ▼ ▼ ││ ┌──────────┐ ┌──────────────┐ ┌──────────┐ ││ │file-exists│ │output-contains│ │ (more) │ ││ └─────┬────┘ └──────┬───────┘ └────┬─────┘ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌──────────────────────────────────────────┐ ││ │ Score: 0.85 (pass@3: 0.97) ✔ PASSED │ ││ └──────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────┘Next steps
Section titled “Next steps”- Grader taxonomy — how graders declare their properties
- Writing custom graders — build your own