Writing Custom Executors
By default, Vally runs evals locally using the built-in copilot-sdk executor. The framework also ships a built-in mock executor that returns a pass-through trajectory for testing graders without a live agent — use executor: mock in your eval config for grader-only testing. Custom executors let you run evals on different backends — remote sandboxes, alternative AI SDKs, or specialized testing harnesses.
The Executor interface
Section titled “The Executor interface”Every executor implements this interface:
interface Executor { name: string; execute(stimulus: Stimulus, options: ExecutorOptions): Promise<Trajectory>; shutdown(): Promise<void>;
// Optional — for remote executors that need to sync files back finalizeWorkspace?(context: FinalizeWorkspaceContext): Promise<void>;}| Method | Purpose |
|---|---|
name | Unique identifier used in config.executor in eval specs |
execute() | Run a single stimulus and return a trajectory with events, metrics, and output |
shutdown() | Clean up resources (connections, processes, sandboxes) |
finalizeWorkspace() | (Optional) Sync remote workspace files back to the local workDir after execution |
ExecutorOptions
Section titled “ExecutorOptions”The pipeline passes these options to every execute() call:
interface ExecutorOptions { skills?: Skill[]; // Skills to load for this run timeout: number; // Timeout in milliseconds workDir: string; // Working directory for the agent model?: string; // Model override sessionID?: string; // Resume an existing session mcpServers?: Record<string, McpServerConfig>; // MCP servers to attach}Implementing an executor
Section titled “Implementing an executor”Here’s a minimal executor that wraps a hypothetical AI SDK:
import type { Executor, ExecutorOptions, Stimulus, Trajectory } from "@microsoft/vally";import { computeMetrics } from "@microsoft/vally";
export class MyExecutor implements Executor { name = "my-executor";
async execute(stimulus: Stimulus, options: ExecutorOptions): Promise<Trajectory> { const startedAt = new Date();
// Call your AI backend const response = await myAiSdk.complete({ prompt: stimulus.prompt, model: options.model, timeout: options.timeout, });
const completedAt = new Date(); const events = convertToTrajectoryEvents(response); const metrics = computeMetrics(events);
return { id: crypto.randomUUID(), stimulus, events, output: response.text, workDir: options.workDir, metadata: { startedAt, completedAt, model: options.model ?? "unknown", executor: this.name, skillsLoaded: [], sessionID: "", }, metrics: { ...metrics, wallTimeMs: completedAt.getTime() - startedAt.getTime(), }, }; }
async shutdown(): Promise<void> { // Clean up connections, pools, etc. }}Shipping as a plugin package
Section titled “Shipping as a plugin package”Custom executors can be shipped in a separate npm package and loaded at runtime via --executor-plugin.
-
Create the package
Your package exports a
registerExecutorsfunction that receives the executor registry:src/index.ts import type { ExecutorRegistry } from "@microsoft/vally";import { MyExecutor } from "./my-executor.js";export function registerExecutors(registry: ExecutorRegistry): void {registry.register(new MyExecutor());}Set
@microsoft/vallyas a peer dependency:package.json {"name": "@myorg/vally-executor-custom","main": "dist/index.js","peerDependencies": {"@microsoft/vally": "^0.3.0"}} -
Use it from the CLI
Terminal window # npm packagevally eval --executor-plugin @myorg/vally-executor-custom --eval-spec eval.yaml# local path (useful during development)vally eval --executor-plugin ./my-executor --eval-spec eval.yaml# multiple executor plugins — useful when a single invocation runs several# eval specs (e.g., via -e A.yaml -e B.yaml or a suite) that select# different executors via their `config.executor` fieldvally eval --executor-plugin @myorg/exec-a --executor-plugin @myorg/exec-b -e A.yaml -e B.yaml -
Reference it in eval.yaml
Use the executor’s
nameproperty in the eval spec:eval.yaml config:executor: my-executormodel: gpt-5.5
Lifecycle and resource management
Section titled “Lifecycle and resource management”execute()may be called many times concurrently (controlled by--workers)shutdown()is called once when the eval run completes, even if errors occurred- If your executor manages a pool of resources (connections, sandboxes), initialize lazily in
execute()and clean up inshutdown() - Register a signal handler for
SIGINT/SIGTERMif your resources are expensive to leak
Remote executors and workspace sync
Section titled “Remote executors and workspace sync”Local executors write files directly to options.workDir. For remote executors (where the agent runs on a different machine), files created during execution don’t exist locally. Without handling this, file-based graders (file-exists, file-contains, run-command, etc.) will silently grade stale state.
The pipeline provides a workspace materialization contract:
Option 1: Implement finalizeWorkspace()
Section titled “Option 1: Implement finalizeWorkspace()”If your executor can sync files back, implement this optional method:
import type { Executor, FinalizeWorkspaceContext } from "@microsoft/vally";
export class RemoteSandboxExecutor implements Executor { name = "remote-sandbox";
async execute(stimulus, options) { // Run agent on remote sandbox... return trajectory; }
async finalizeWorkspace(context: FinalizeWorkspaceContext): Promise<void> { // Download final workspace state from remote to context.workDir. // Use mirror semantics — local dir must match remote exactly, // including any files that were deleted during execution. await downloadRemoteWorkspace(context.trajectory.metadata.sessionID, context.workDir); }
async shutdown() { /* ... */ }}The pipeline calls finalizeWorkspace() after execute() but before grading, and sets trajectory.workspaceStatus to "materialized".
Option 2: Mark workspace as remote
Section titled “Option 2: Mark workspace as remote”If your executor can’t sync files back, set workspaceStatus: "remote" in the returned trajectory:
async execute(stimulus, options) { // ... return { ...trajectory, workspaceStatus: "remote", // Tells graders workspace isn't local };}The pipeline will error with a clear message if any file-based graders are configured, rather than producing silently wrong scores. Output-based graders (output-contains, tool-calls, etc.) will still work.
Next steps
Section titled “Next steps”- Eval spec reference —
config.executorfield - CLI reference: eval —
--executor-pluginflag - Writing custom graders — the grader plugin system follows the same pattern
- Trajectory format — the data structure your executor must return