Writing Custom Graders

Vally ships with built-in graders, but real-world evals often need domain-specific checks. This guide walks through building a custom grader from scratch.

The Grader interface

Every grader implements this interface:

interface Grader {
  metadata: GraderMetadata;
  grade(input: GraderInput): Promise<GraderResult>;
  // Optional: implement to support head-to-head comparison (vally compare).
  compare?(input: GraderComparisonInput): Promise<GraderComparisonResult>;
}

interface GraderMetadata {
  name: string;
  description: string;
  behavior: GraderBehavior;
  determinism: "static" | "complex-static" | "slm" | "llm";
  reference: "reference-free" | "reference-based";
  temporalScope: "point-in-time" | "trajectory-level" | "cross-trajectory";
  costProfile: "free" | "low" | "medium" | "high";
}

interface GraderBehavior {
  requiresLlmClient?: boolean;
  requiresWorkspace?: boolean;
}

The grade method receives a GraderInput containing:

trajectory — the full event log from the agent run
stimulus — the prompt and config that produced it
config — grader-specific config from the eval spec

Comparison capability (optional)

A grader can support head-to-head comparison by implementing the optional compare() method — implementing it is what marks the grader as comparison-capable. It receives a GraderComparisonInput (a baseline and a treatment trajectory for the same stimulus), returns a signed, treatment-relative GraderComparisonResult, and is invoked by vally compare. The built-in prompt grader implements it.

Example: a “no-errors” grader

Let’s build a grader that checks whether the agent produced any errors during its run.

Define the grader class

import type { Grader, GraderMetadata, GraderInput, GraderResult } from "@microsoft/vally";

export class NoErrorsGrader implements Grader {
  metadata: GraderMetadata = {
    name: "no-errors",
    description: "Checks that the agent produced no error events",
    behavior: {},
    determinism: "static",
    costProfile: "free",
    reference: "reference-free",
    temporalScope: "trajectory-level",
  };

  async grade(input: GraderInput): Promise<GraderResult> {
    if (!input.trajectory) {
      throw new Error("Missing trajectory");
    }

    const errors = input.trajectory.events.filter((e) => e.type === "error");
    const passed = errors.length === 0;

    return {
      name: this.metadata.name,
      kind: "code",
      passed,
      score: passed ? 1 : 0,
      evidence: passed
        ? "No error events in trajectory"
        : `${errors.length} error(s): ${errors.map((e) => e.data.message).join(", ")}`,
      label: passed ? "correct" : "incorrect",
    };
  }
}

Register it

import { createGraderRegistry } from "@microsoft/vally";
import { NoErrorsGrader } from "./no-errors-grader.js";

const registry = createGraderRegistry();
registry.register(new NoErrorsGrader());

Use it in eval.yaml

stimuli:
  - name: test-case
    prompt: "Do something"
    graders:
      - type: no-errors

Example: a tool-count grader

A grader that checks the agent used a reasonable number of tool calls:

import type { Grader, GraderMetadata, GraderInput, GraderResult } from "@microsoft/vally";

interface Config {
  min?: number;
  max?: number;
}

export class ToolCountGrader implements Grader {
  metadata: GraderMetadata = {
    name: "tool-count",
    description: "Checks that tool call count is within expected range",
    behavior: {},
    determinism: "static",
    costProfile: "free",
    reference: "reference-free",
    temporalScope: "trajectory-level",
  };

  async grade(input: GraderInput): Promise<GraderResult> {
    if (!input.trajectory) throw new Error("Missing trajectory");

    const config = (input.config ?? {}) as Config;
    const count = input.trajectory.metrics.toolCallCount;
    const min = config.min ?? 0;
    const max = config.max ?? Infinity;
    const passed = count >= min && count <= max;

    return {
      name: this.metadata.name,
      kind: "code",
      passed,
      score: passed ? 1 : 0,
      evidence: `${count} tool calls (expected ${min}–${max === Infinity ? "∞" : max})`,
      label: passed ? "correct" : "incorrect",
    };
  }
}

Use in eval.yaml:

graders:
  - type: tool-count
    config:
      min: 1
      max: 10

Taxonomy guidelines

Choose taxonomy values honestly — they’re surfaced in reports and help eval authors decide whether to include your grader in fast inner-loop runs or reserve it for outer-loop evaluation:

If your grader…	Set determinism to…	Set cost to…
Does string/file operations only	`static`	`free` or `low`
Runs a subprocess or does I/O	`complex-static`	`low`
Calls an embedding/small model	`slm`	`medium`
Calls GPT-5.5 or similar	`llm`	`high`

Testing your grader

Write tests that exercise both passing and failing cases:

import { describe, it, expect } from "vitest";
import { NoErrorsGrader } from "./no-errors-grader.js";

describe("NoErrorsGrader", () => {
  const grader = new NoErrorsGrader();

  it("passes when no errors", async () => {
    const result = await grader.grade({
      trajectory: {
        id: "test-1",
        events: [
          {
            type: "tool_call",
            timestamp: new Date(),
            data: { toolName: "read_file", toolCallId: "1" },
          },
        ],
        metrics: { errorCount: 0 },
        output: "done",
        workDir: "/tmp",
      },
    });
    expect(result.passed).toBe(true);
    expect(result.score).toBe(1);
  });

  it("fails when errors exist", async () => {
    const result = await grader.grade({
      trajectory: {
        id: "test-2",
        events: [{ type: "error", timestamp: new Date(), data: { message: "timeout" } }],
        metrics: { errorCount: 1 },
        output: "",
        workDir: "/tmp",
      },
    });
    expect(result.passed).toBe(false);
    expect(result.evidence).toContain("timeout");
  });
});

Shipping as a plugin package

Custom graders can be shipped in a separate npm package and loaded at runtime via --grader-plugin. This lets teams share graders across repos without forking vally.

Create the package

Your package exports a registerGraders function that receives the grader registry:

import type { GraderRegistry } from "@microsoft/vally";
import { NoErrorsGrader } from "./no-errors-grader.js";

export function registerGraders(registry: GraderRegistry): void {
  registry.register(new NoErrorsGrader());
}

Set @microsoft/vally as a peer dependency in your package.json:

{
  "name": "@myorg/vally-grader-quality",
  "main": "dist/index.js",
  "peerDependencies": {
    "@microsoft/vally": "^0.2.0"
  }
}

Use it from the CLI

Pass the package name or path to any vally command:

# npm package
vally eval --grader-plugin @myorg/vally-grader-quality --eval-spec eval.yaml

# local path
vally eval --grader-plugin ./my-graders --eval-spec eval.yaml

# works with lint, grade, compare, and export too
vally lint --eval-spec eval.yaml --grader-plugin ./my-graders

Reference plugin graders in eval.yaml

Plugin graders are referenced by name, just like built-ins:

graders:
  - type: no-errors
  - type: output-contains
    config:
      substring: "hello"

Next steps

Grader taxonomy — deep dive on taxonomy dimensions
Grader catalog — built-in grader examples
How it works — where graders fit in the pipeline