Skip to content

Writing Custom Graders

Vally ships with built-in graders, but real-world evals often need domain-specific checks. This guide walks through building a custom grader from scratch.

Every grader implements this interface:

interface Grader {
metadata: GraderMetadata;
grade(input: GraderInput): Promise<GraderResult>;
}
interface GraderMetadata {
name: string;
description: string;
behavior: GraderBehavior;
determinism: "static" | "complex-static" | "slm" | "llm";
portability: "t1-universal" | "t2-domain" | "t3a-scenario" | "t3b-system";
reference: "reference-free" | "reference-based";
temporalScope: "point-in-time" | "trajectory-level" | "cross-trajectory";
costProfile: "free" | "low" | "medium" | "high";
}
interface GraderBehavior {
execution: "single" | "comparative";
requiresLlmClient?: boolean;
}

The grade method receives a GraderInput containing:

  • trajectory — the full event log from the agent run
  • stimulus — the prompt and config that produced it
  • config — grader-specific config from the eval spec
  • trajectoryB (optional) — for pairwise A/B comparisons

Let’s build a grader that checks whether the agent produced any errors during its run.

  1. Define the grader class

    no-errors-grader.ts
    import type { Grader, GraderMetadata, GraderInput, GraderResult } from "@microsoft/vally";
    export class NoErrorsGrader implements Grader {
    metadata: GraderMetadata = {
    name: "no-errors",
    description: "Checks that the agent produced no error events",
    behavior: { execution: "single" },
    determinism: "static",
    costProfile: "free",
    portability: "t1-universal",
    reference: "reference-free",
    temporalScope: "trajectory-level",
    };
    async grade(input: GraderInput): Promise<GraderResult> {
    if (!input.trajectory) {
    throw new Error("Missing trajectory");
    }
    const errors = input.trajectory.events.filter((e) => e.type === "error");
    const passed = errors.length === 0;
    return {
    name: this.metadata.name,
    kind: "code",
    passed,
    score: passed ? 1 : 0,
    evidence: passed
    ? "No error events in trajectory"
    : `${errors.length} error(s): ${errors.map((e) => e.data.message).join(", ")}`,
    label: passed ? "correct" : "incorrect",
    };
    }
    }
  2. Register it

    register.ts
    import { createGraderRegistry } from "@microsoft/vally";
    import { NoErrorsGrader } from "./no-errors-grader.js";
    const registry = createGraderRegistry();
    registry.register(new NoErrorsGrader());
  3. Use it in eval.yaml

    eval.yaml
    stimuli:
    - name: test-case
    prompt: "Do something"
    graders:
    - type: no-errors

A grader that checks the agent used a reasonable number of tool calls:

tool-count-grader.ts
import type { Grader, GraderMetadata, GraderInput, GraderResult } from "@microsoft/vally";
interface Config {
min?: number;
max?: number;
}
export class ToolCountGrader implements Grader {
metadata: GraderMetadata = {
name: "tool-count",
description: "Checks that tool call count is within expected range",
behavior: { execution: "single" },
determinism: "static",
costProfile: "free",
portability: "t1-universal",
reference: "reference-free",
temporalScope: "trajectory-level",
};
async grade(input: GraderInput): Promise<GraderResult> {
if (!input.trajectory) throw new Error("Missing trajectory");
const config = (input.config ?? {}) as Config;
const count = input.trajectory.metrics.toolCallCount;
const min = config.min ?? 0;
const max = config.max ?? Infinity;
const passed = count >= min && count <= max;
return {
name: this.metadata.name,
kind: "code",
passed,
score: passed ? 1 : 0,
evidence: `${count} tool calls (expected ${min}${max === Infinity ? "" : max})`,
label: passed ? "correct" : "incorrect",
};
}
}

Use in eval.yaml:

graders:
- type: tool-count
config:
min: 1
max: 10

Choose taxonomy values honestly — they’re surfaced in reports and help eval authors decide whether to include your grader in fast inner-loop runs or reserve it for outer-loop evaluation:

If your grader…Set determinism to…Set cost to…
Does string/file operations onlystaticfree or low
Runs a subprocess or does I/Ocomplex-staticlow
Calls an embedding/small modelslmmedium
Calls GPT-5.5 or similarllmhigh

Write tests that exercise both passing and failing cases:

no-errors-grader.test.ts
import { describe, it, expect } from "vitest";
import { NoErrorsGrader } from "./no-errors-grader.js";
describe("NoErrorsGrader", () => {
const grader = new NoErrorsGrader();
it("passes when no errors", async () => {
const result = await grader.grade({
trajectory: {
id: "test-1",
events: [
{
type: "tool_call",
timestamp: new Date(),
data: { toolName: "read_file", toolCallId: "1" },
},
],
metrics: { errorCount: 0 },
output: "done",
workDir: "/tmp",
},
});
expect(result.passed).toBe(true);
expect(result.score).toBe(1);
});
it("fails when errors exist", async () => {
const result = await grader.grade({
trajectory: {
id: "test-2",
events: [{ type: "error", timestamp: new Date(), data: { message: "timeout" } }],
metrics: { errorCount: 1 },
output: "",
workDir: "/tmp",
},
});
expect(result.passed).toBe(false);
expect(result.evidence).toContain("timeout");
});
});

Custom graders can be shipped in a separate npm package and loaded at runtime via --grader-plugin. This lets teams share graders across repos without forking vally.

  1. Create the package

    Your package exports a registerGraders function that receives the grader registry:

    src/index.ts
    import type { GraderRegistry } from "@microsoft/vally";
    import { NoErrorsGrader } from "./no-errors-grader.js";
    export function registerGraders(registry: GraderRegistry): void {
    registry.register(new NoErrorsGrader());
    }

    Set @microsoft/vally as a peer dependency in your package.json:

    package.json
    {
    "name": "@myorg/vally-grader-quality",
    "main": "dist/index.js",
    "peerDependencies": {
    "@microsoft/vally": "^0.2.0"
    }
    }
  2. Use it from the CLI

    Pass the package name or path to any vally command:

    Terminal window
    # npm package
    vally eval --grader-plugin @myorg/vally-grader-quality --eval-spec eval.yaml
    # local path
    vally eval --grader-plugin ./my-graders --eval-spec eval.yaml
    # works with lint, grade, compare, and export too
    vally lint --eval eval.yaml --grader-plugin ./my-graders
  3. Reference plugin graders in eval.yaml

    Plugin graders are referenced by name, just like built-ins:

    eval.yaml
    graders:
    - type: no-errors
    - type: output-contains
    config:
    substring: "hello"