Writing Custom Graders
Vally ships with built-in graders, but real-world evals often need domain-specific checks. This guide walks through building a custom grader from scratch.
The Grader interface
Section titled “The Grader interface”Every grader implements this interface:
interface Grader { metadata: GraderMetadata; grade(input: GraderInput): Promise<GraderResult>;}
interface GraderMetadata { name: string; description: string; behavior: GraderBehavior; determinism: "static" | "complex-static" | "slm" | "llm"; portability: "t1-universal" | "t2-domain" | "t3a-scenario" | "t3b-system"; reference: "reference-free" | "reference-based"; temporalScope: "point-in-time" | "trajectory-level" | "cross-trajectory"; costProfile: "free" | "low" | "medium" | "high";}
interface GraderBehavior { execution: "single" | "comparative"; requiresLlmClient?: boolean;}The grade method receives a GraderInput containing:
trajectory— the full event log from the agent runstimulus— the prompt and config that produced itconfig— grader-specific config from the eval spectrajectoryB(optional) — for pairwise A/B comparisons
Example: a “no-errors” grader
Section titled “Example: a “no-errors” grader”Let’s build a grader that checks whether the agent produced any errors during its run.
-
Define the grader class
no-errors-grader.ts import type { Grader, GraderMetadata, GraderInput, GraderResult } from "@microsoft/vally";export class NoErrorsGrader implements Grader {metadata: GraderMetadata = {name: "no-errors",description: "Checks that the agent produced no error events",behavior: { execution: "single" },determinism: "static",costProfile: "free",portability: "t1-universal",reference: "reference-free",temporalScope: "trajectory-level",};async grade(input: GraderInput): Promise<GraderResult> {if (!input.trajectory) {throw new Error("Missing trajectory");}const errors = input.trajectory.events.filter((e) => e.type === "error");const passed = errors.length === 0;return {name: this.metadata.name,kind: "code",passed,score: passed ? 1 : 0,evidence: passed? "No error events in trajectory": `${errors.length} error(s): ${errors.map((e) => e.data.message).join(", ")}`,label: passed ? "correct" : "incorrect",};}} -
Register it
register.ts import { createGraderRegistry } from "@microsoft/vally";import { NoErrorsGrader } from "./no-errors-grader.js";const registry = createGraderRegistry();registry.register(new NoErrorsGrader()); -
Use it in eval.yaml
eval.yaml stimuli:- name: test-caseprompt: "Do something"graders:- type: no-errors
Example: a tool-count grader
Section titled “Example: a tool-count grader”A grader that checks the agent used a reasonable number of tool calls:
import type { Grader, GraderMetadata, GraderInput, GraderResult } from "@microsoft/vally";
interface Config { min?: number; max?: number;}
export class ToolCountGrader implements Grader { metadata: GraderMetadata = { name: "tool-count", description: "Checks that tool call count is within expected range", behavior: { execution: "single" }, determinism: "static", costProfile: "free", portability: "t1-universal", reference: "reference-free", temporalScope: "trajectory-level", };
async grade(input: GraderInput): Promise<GraderResult> { if (!input.trajectory) throw new Error("Missing trajectory");
const config = (input.config ?? {}) as Config; const count = input.trajectory.metrics.toolCallCount; const min = config.min ?? 0; const max = config.max ?? Infinity; const passed = count >= min && count <= max;
return { name: this.metadata.name, kind: "code", passed, score: passed ? 1 : 0, evidence: `${count} tool calls (expected ${min}–${max === Infinity ? "∞" : max})`, label: passed ? "correct" : "incorrect", }; }}Use in eval.yaml:
graders: - type: tool-count config: min: 1 max: 10Taxonomy guidelines
Section titled “Taxonomy guidelines”Choose taxonomy values honestly — they’re surfaced in reports and help eval authors decide whether to include your grader in fast inner-loop runs or reserve it for outer-loop evaluation:
| If your grader… | Set determinism to… | Set cost to… |
|---|---|---|
| Does string/file operations only | static | free or low |
| Runs a subprocess or does I/O | complex-static | low |
| Calls an embedding/small model | slm | medium |
| Calls GPT-5.5 or similar | llm | high |
Testing your grader
Section titled “Testing your grader”Write tests that exercise both passing and failing cases:
import { describe, it, expect } from "vitest";import { NoErrorsGrader } from "./no-errors-grader.js";
describe("NoErrorsGrader", () => { const grader = new NoErrorsGrader();
it("passes when no errors", async () => { const result = await grader.grade({ trajectory: { id: "test-1", events: [ { type: "tool_call", timestamp: new Date(), data: { toolName: "read_file", toolCallId: "1" }, }, ], metrics: { errorCount: 0 }, output: "done", workDir: "/tmp", }, }); expect(result.passed).toBe(true); expect(result.score).toBe(1); });
it("fails when errors exist", async () => { const result = await grader.grade({ trajectory: { id: "test-2", events: [{ type: "error", timestamp: new Date(), data: { message: "timeout" } }], metrics: { errorCount: 1 }, output: "", workDir: "/tmp", }, }); expect(result.passed).toBe(false); expect(result.evidence).toContain("timeout"); });});Shipping as a plugin package
Section titled “Shipping as a plugin package”Custom graders can be shipped in a separate npm package and loaded at runtime via --grader-plugin. This lets teams share graders across repos without forking vally.
-
Create the package
Your package exports a
registerGradersfunction that receives the grader registry:src/index.ts import type { GraderRegistry } from "@microsoft/vally";import { NoErrorsGrader } from "./no-errors-grader.js";export function registerGraders(registry: GraderRegistry): void {registry.register(new NoErrorsGrader());}Set
@microsoft/vallyas a peer dependency in yourpackage.json:package.json {"name": "@myorg/vally-grader-quality","main": "dist/index.js","peerDependencies": {"@microsoft/vally": "^0.2.0"}} -
Use it from the CLI
Pass the package name or path to any vally command:
Terminal window # npm packagevally eval --grader-plugin @myorg/vally-grader-quality --eval-spec eval.yaml# local pathvally eval --grader-plugin ./my-graders --eval-spec eval.yaml# works with lint, grade, compare, and export toovally lint --eval eval.yaml --grader-plugin ./my-graders -
Reference plugin graders in eval.yaml
Plugin graders are referenced by name, just like built-ins:
eval.yaml graders:- type: no-errors- type: output-containsconfig:substring: "hello"
Next steps
Section titled “Next steps”- Grader taxonomy — deep dive on taxonomy dimensions
- Grader catalog — built-in grader examples
- How it works — where graders fit in the pipeline