Skip to content

Grader Taxonomy

Every grader in Vally declares a set of taxonomy metadata — structured properties that describe what the grader does, how much it costs, and where it can run. You use this metadata to decide which graders to include in your eval.yaml for each context (inner loop, CI, outer loop).

Grader metadata separates two concerns:

  • Behavior (metadata.behavior) — how the pipeline treats the grader at runtime. This has two fields:

    • execution: "single" (grades one trajectory) or "comparative" (compares two, like pairwise A/B)
    • requiresLlmClient: whether the grader needs an LLM client to be provisioned
  • Taxonomy (the remaining fields) — descriptive classification shown in reports and used when authoring evals to pick the right graders for a given loop.

This separation ensures that adding a new grader to a plugin package doesn’t require understanding pipeline internals — you just set behavior and the system handles the rest.

Every grader declares these six properties in its metadata:

How reproducible is this grader’s output?

ValueMeaningExample
staticAlways produces the same result for the same input. No randomness, no external calls.output-contains, file-exists
complex-staticDeterministic but computationally heavier (regex, AST parsing, command execution).run-command, output-matches
slmUses a small language model. Mostly reproducible but not guaranteed.Embedding similarity checker
llmUses a large language model. Non-deterministic, requires multiple trials for confidence.LLM judge, pairwise comparison

How expensive is this grader to run?

ValueMeaningTypical use
freeNo compute cost beyond basic CPU.String matching, file checks
lowMinimal cost — local computation, small I/O.Command execution, glob matching
mediumModerate cost — SLM inference, API calls.Embedding models
highSignificant cost — LLM inference, multi-turn evaluation.GPT-5.5 judges, A/B comparisons

How broadly applicable is this grader?

ValueMeaningExample
t1-universalWorks for any skill in any domain.output-contains, file-exists
t2-domainWorks within a specific domain (e.g., code generation, data analysis).Code compilation checker
t3a-scenarioSpecific to a particular evaluation scenario.Custom business logic check
t3b-systemTied to a specific system or environment.Checks a specific API endpoint

Does the grader need a “correct answer” to compare against?

ValueMeaning
reference-freeGrades based only on the output/trajectory itself.
reference-basedNeeds a gold-standard reference answer to compare against.

What part of the trajectory does the grader inspect?

ValueMeaning
point-in-timeLooks at a single moment (e.g., final output).
trajectory-levelInspects the full event sequence of one run.
cross-trajectoryCompares across multiple trajectories (A/B testing).

How was the score produced?

ValueMeaning
codeComputed deterministically by code.
llmProduced by an LLM judge.
humanAssigned by a human reviewer.

Taxonomy is descriptive metadata. When you write an eval.yaml, you explicitly list the graders you want to run; the taxonomy fields help you decide which graders are appropriate for each loop.

For the inner loop, you just run vally lint — no eval.yaml needed. Lint runs Vally’s built-in static skill checks (spec compliance, valid refs, orphan files). Result: instant feedback, zero API calls.

In your eval.yaml, include only static and complex-static graders (like output-contains, file-exists, run-command). Set runs: 3 for basic confidence:

eval.yaml
config:
runs: 3
stimuli:
- name: basic-check
prompt: "..."
graders:
- type: file-exists # static, low cost
config: { path: "*.test.js" }
- type: run-command # complex-static, low cost
config: { command: "npm test" }

Add prompt graders (LLM judges) and increase runs for statistical confidence:

eval.yaml
config:
runs: 5
judge_model: gpt-5.5
stimuli:
- name: quality-check
prompt: "..."
graders:
- type: file-exists
config: { path: "*.test.js" }
- type: prompt # llm, high cost
config:
prompt: "Are the tests comprehensive?"

When you write a custom grader, you declare its taxonomy in the metadata property:

import type { Grader, GraderMetadata, GraderInput, GraderResult } from "@microsoft/vally";
export class MyCustomGrader implements Grader {
metadata: GraderMetadata = {
name: "my-custom-check",
description: "Checks something specific to my domain",
behavior: { execution: "single" },
determinism: "complex-static",
costProfile: "low",
portability: "t2-domain",
reference: "reference-free",
temporalScope: "trajectory-level",
};
async grade(input: GraderInput): Promise<GraderResult> {
// your grading logic here
}
}

Declare your grader’s taxonomy honestly. The fields are used by eval authors to decide whether to include the grader in fast inner-loop runs or reserve it for outer-loop evaluation.