Skip to content

Writing Custom Executors

By default, Vally runs evals locally using the built-in copilot-sdk executor. The framework also ships a built-in mock executor that returns a pass-through trajectory for testing graders without a live agent — use executor: mock in your eval config for grader-only testing. Custom executors let you run evals on different backends — remote sandboxes, alternative AI SDKs, or specialized testing harnesses.

Every executor implements this interface:

interface Executor {
name: string;
execute(stimulus: Stimulus, options: ExecutorOptions): Promise<Trajectory>;
shutdown(): Promise<void>;
// Optional — for remote executors that need to sync files back
finalizeWorkspace?(context: FinalizeWorkspaceContext): Promise<void>;
}
MethodPurpose
nameUnique identifier used in config.executor in eval specs
execute()Run a single stimulus and return a trajectory with events, metrics, and output
shutdown()Clean up resources (connections, processes, sandboxes)
finalizeWorkspace()(Optional) Sync remote workspace files back to the local workDir after execution

The pipeline passes these options to every execute() call:

interface ExecutorOptions {
skills?: Skill[]; // Skills to load for this run
timeout: number; // Timeout in milliseconds
workDir: string; // Working directory for the agent
model?: string; // Model override
sessionID?: string; // Resume an existing session
mcpServers?: Record<string, McpServerConfig>; // MCP servers to attach
}

Here’s a minimal executor that wraps a hypothetical AI SDK:

src/my-executor.ts
import type { Executor, ExecutorOptions, Stimulus, Trajectory } from "@microsoft/vally";
import { computeMetrics } from "@microsoft/vally";
export class MyExecutor implements Executor {
name = "my-executor";
async execute(stimulus: Stimulus, options: ExecutorOptions): Promise<Trajectory> {
const startedAt = new Date();
// Call your AI backend
const response = await myAiSdk.complete({
prompt: stimulus.prompt,
model: options.model,
timeout: options.timeout,
});
const completedAt = new Date();
const events = convertToTrajectoryEvents(response);
const metrics = computeMetrics(events);
return {
id: crypto.randomUUID(),
stimulus,
events,
output: response.text,
workDir: options.workDir,
metadata: {
startedAt,
completedAt,
model: options.model ?? "unknown",
executor: this.name,
skillsLoaded: [],
sessionID: "",
},
metrics: {
...metrics,
wallTimeMs: completedAt.getTime() - startedAt.getTime(),
},
};
}
async shutdown(): Promise<void> {
// Clean up connections, pools, etc.
}
}

Custom executors can be shipped in a separate npm package and loaded at runtime via --executor-plugin.

  1. Create the package

    Your package exports a registerExecutors function that receives the executor registry:

    src/index.ts
    import type { ExecutorRegistry } from "@microsoft/vally";
    import { MyExecutor } from "./my-executor.js";
    export function registerExecutors(registry: ExecutorRegistry): void {
    registry.register(new MyExecutor());
    }

    Set @microsoft/vally as a peer dependency:

    package.json
    {
    "name": "@myorg/vally-executor-custom",
    "main": "dist/index.js",
    "peerDependencies": {
    "@microsoft/vally": "^0.3.0"
    }
    }
  2. Use it from the CLI

    Terminal window
    # npm package
    vally eval --executor-plugin @myorg/vally-executor-custom --eval-spec eval.yaml
    # local path (useful during development)
    vally eval --executor-plugin ./my-executor --eval-spec eval.yaml
    # multiple executor plugins — useful when a single invocation runs several
    # eval specs (e.g., via -e A.yaml -e B.yaml or a suite) that select
    # different executors via their `config.executor` field
    vally eval --executor-plugin @myorg/exec-a --executor-plugin @myorg/exec-b -e A.yaml -e B.yaml
  3. Reference it in eval.yaml

    Use the executor’s name property in the eval spec:

    eval.yaml
    config:
    executor: my-executor
    model: gpt-5.5
  • execute() may be called many times concurrently (controlled by --workers)
  • shutdown() is called once when the eval run completes, even if errors occurred
  • If your executor manages a pool of resources (connections, sandboxes), initialize lazily in execute() and clean up in shutdown()
  • Register a signal handler for SIGINT/SIGTERM if your resources are expensive to leak

Local executors write files directly to options.workDir. For remote executors (where the agent runs on a different machine), files created during execution don’t exist locally. Without handling this, file-based graders (file-exists, file-contains, run-command, etc.) will silently grade stale state.

The pipeline provides a workspace materialization contract:

If your executor can sync files back, implement this optional method:

src/remote-executor.ts
import type { Executor, FinalizeWorkspaceContext } from "@microsoft/vally";
export class RemoteSandboxExecutor implements Executor {
name = "remote-sandbox";
async execute(stimulus, options) {
// Run agent on remote sandbox...
return trajectory;
}
async finalizeWorkspace(context: FinalizeWorkspaceContext): Promise<void> {
// Download final workspace state from remote to context.workDir.
// Use mirror semantics — local dir must match remote exactly,
// including any files that were deleted during execution.
await downloadRemoteWorkspace(context.trajectory.metadata.sessionID, context.workDir);
}
async shutdown() {
/* ... */
}
}

The pipeline calls finalizeWorkspace() after execute() but before grading, and sets trajectory.workspaceStatus to "materialized".

If your executor can’t sync files back, set workspaceStatus: "remote" in the returned trajectory:

async execute(stimulus, options) {
// ...
return {
...trajectory,
workspaceStatus: "remote", // Tells graders workspace isn't local
};
}

The pipeline will error with a clear message if any file-based graders are configured, rather than producing silently wrong scores. Output-based graders (output-contains, tool-calls, etc.) will still work.