Writing Custom Executors

By default, Vally runs evals locally using the built-in copilot-sdk executor. The framework also ships a built-in mock executor that returns a pass-through trajectory for testing graders without a live agent — use executor: mock in your eval config for grader-only testing. Custom executors let you run evals on different backends — remote sandboxes, alternative AI SDKs, or specialized testing harnesses.

The Executor interface

Every executor implements this interface:

interface Executor {
  name: string;

  // Optional — when true, the pipeline may prepare a workspace once and copy it per trial
  supportsPreparedWorkspace?: boolean;

  // Optional — when true, the executor supports multi-turn stimuli (Stimulus.turns)
  // Must be set to opt in; runEval() rejects multi-turn stimuli on executors without this flag
  supportsMultiTurn?: boolean;

  execute(stimulus: Stimulus, options: ExecutorOptions): Promise<Trajectory>;
  shutdown(): Promise<void>;
}

Method	Purpose
`name`	Unique identifier used in `defaults.executor` in eval specs
`execute()`	Run a single stimulus and return a trajectory with events, metrics, and output
`shutdown()`	Clean up resources (connections, processes, sandboxes)
`supportsPreparedWorkspace`	(Optional) When `true`, the pipeline prepares the workspace once and copies it per trial
`supportsMultiTurn`	(Optional) When `true`, the executor handles `Stimulus.turns` — each prompt is sent sequentially on the same session. Without this flag, multi-turn stimuli fail fast with a clear error.

ExecutorOptions

The pipeline passes these options to every execute() call:

interface ExecutorOptions {
  skills?: Skill[]; // Skills to load for this run
  timeout: number; // Timeout in milliseconds
  workDir: string; // Working directory for the agent
  model?: string; // Model override
  sessionID?: string; // Resume an existing session
  mcpServers?: Record<string, McpServerConfig>; // MCP servers to attach
  onRawEvent?: (event: unknown) => void; // Callback for each raw SDK event during execution
  sessionLog?: ExecutorSessionLogOptions; // Per-run directory for native session files (e.g. events.jsonl)
}

interface ExecutorSessionLogOptions {
  rootDir: string; // Local root directory for executor-native session state for this run
  sessionID?: string; // Optional stable session ID for executors that support caller-provided IDs
}

Implementing an executor

Here’s a minimal executor that wraps a hypothetical AI SDK:

import type { Executor, ExecutorOptions, Stimulus, Trajectory } from "@microsoft/vally";
import { computeMetrics } from "@microsoft/vally";

export class MyExecutor implements Executor {
  name = "my-executor";

  async execute(stimulus: Stimulus, options: ExecutorOptions): Promise<Trajectory> {
    const startedAt = new Date();

    // Call your AI backend
    const response = await myAiSdk.complete({
      prompt: stimulus.prompt,
      model: options.model,
      timeout: options.timeout,
    });

    const completedAt = new Date();
    const events = convertToTrajectoryEvents(response);
    const metrics = computeMetrics(events);

    return {
      id: crypto.randomUUID(),
      stimulus,
      events,
      output: response.text,
      workDir: options.workDir,
      metadata: {
        startedAt,
        completedAt,
        model: options.model ?? "unknown",
        executor: this.name,
        skillsLoaded: [],
        sessionID: "",
      },
      metrics: {
        ...metrics,
        wallTimeMs: completedAt.getTime() - startedAt.getTime(),
      },
    };
  }

  async shutdown(): Promise<void> {
    // Clean up connections, pools, etc.
  }
}

Shipping as a plugin package

Custom executors can be shipped in a separate npm package and loaded at runtime via --executor-plugin.

Create the package

Your package exports a registerExecutors function that receives the executor registry:

import type { ExecutorRegistry } from "@microsoft/vally";
import { MyExecutor } from "./my-executor.js";

export function registerExecutors(registry: ExecutorRegistry): void {
  registry.register(new MyExecutor());
}

Set @microsoft/vally as a peer dependency:

{
  "name": "@myorg/vally-executor-custom",
  "main": "dist/index.js",
  "peerDependencies": {
    "@microsoft/vally": "^0.3.0"
  }
}

Use it from the CLI

# npm package
vally eval --executor-plugin @myorg/vally-executor-custom --eval-spec eval.yaml

# local path (useful during development)
vally eval --executor-plugin ./my-executor --eval-spec eval.yaml

# multiple executor plugins — useful when a single invocation runs several
# eval specs (e.g., via -e A.yaml -e B.yaml or a suite) that select
# different executors via their `defaults.executor` field
vally eval --executor-plugin @myorg/exec-a --executor-plugin @myorg/exec-b -e A.yaml -e B.yaml

Reference it in eval.yaml

Use the executor’s name property in the eval spec:
eval.yaml
```
defaults:
  executor: my-executor
  model: gpt-5.5
```

Lifecycle and resource management

execute() may be called many times concurrently (controlled by --workers)
shutdown() is called once when the eval run completes, even if errors occurred
If your executor manages a pool of resources (connections, sandboxes), initialize lazily in execute() and clean up in shutdown()
Register a signal handler for SIGINT/SIGTERM if your resources are expensive to leak

Workspace location

Executors must write the agent’s files to the options.workDir the pipeline provides; that local directory is what file-based graders (file-exists, file-contains, run-command, etc.) read.

Running the agent on a different machine (a remote sandbox, container, or VM) is a Backend concern, not an executor one: implement a Backend that owns the remote workspace and transfers files back via the Backend egress SPI, rather than handling remoteness inside the executor.

Next steps

Eval spec reference — defaults.executor field
CLI reference: eval — --executor-plugin flag
Writing custom graders — the grader plugin system follows the same pattern
Trajectory format — the data structure your executor must return