Grader Taxonomy
Every grader in Vally declares a set of taxonomy metadata — structured properties that describe what the grader does, how much it costs, and where it can run. You use this metadata to decide which graders to include in your eval.yaml for each context (inner loop, CI, outer loop).
Behavior vs. taxonomy
Section titled “Behavior vs. taxonomy”Grader metadata separates two concerns:
-
Behavior (
metadata.behavior) — how the pipeline treats the grader at runtime. This has two fields:execution:"single"(grades one trajectory) or"comparative"(compares two, like pairwise A/B)requiresLlmClient: whether the grader needs an LLM client to be provisioned
-
Taxonomy (the remaining fields) — descriptive classification shown in reports and used when authoring evals to pick the right graders for a given loop.
This separation ensures that adding a new grader to a plugin package doesn’t require understanding pipeline internals — you just set behavior and the system handles the rest.
The taxonomy dimensions
Section titled “The taxonomy dimensions”Every grader declares these six properties in its metadata:
Determinism
Section titled “Determinism”How reproducible is this grader’s output?
| Value | Meaning | Example |
|---|---|---|
static | Always produces the same result for the same input. No randomness, no external calls. | output-contains, file-exists |
complex-static | Deterministic but computationally heavier (regex, AST parsing, command execution). | run-command, output-matches |
slm | Uses a small language model. Mostly reproducible but not guaranteed. | Embedding similarity checker |
llm | Uses a large language model. Non-deterministic, requires multiple trials for confidence. | LLM judge, pairwise comparison |
Cost profile
Section titled “Cost profile”How expensive is this grader to run?
| Value | Meaning | Typical use |
|---|---|---|
free | No compute cost beyond basic CPU. | String matching, file checks |
low | Minimal cost — local computation, small I/O. | Command execution, glob matching |
medium | Moderate cost — SLM inference, API calls. | Embedding models |
high | Significant cost — LLM inference, multi-turn evaluation. | GPT-5.5 judges, A/B comparisons |
Portability tier
Section titled “Portability tier”How broadly applicable is this grader?
| Value | Meaning | Example |
|---|---|---|
t1-universal | Works for any skill in any domain. | output-contains, file-exists |
t2-domain | Works within a specific domain (e.g., code generation, data analysis). | Code compilation checker |
t3a-scenario | Specific to a particular evaluation scenario. | Custom business logic check |
t3b-system | Tied to a specific system or environment. | Checks a specific API endpoint |
Reference requirement
Section titled “Reference requirement”Does the grader need a “correct answer” to compare against?
| Value | Meaning |
|---|---|
reference-free | Grades based only on the output/trajectory itself. |
reference-based | Needs a gold-standard reference answer to compare against. |
Temporal scope
Section titled “Temporal scope”What part of the trajectory does the grader inspect?
| Value | Meaning |
|---|---|
point-in-time | Looks at a single moment (e.g., final output). |
trajectory-level | Inspects the full event sequence of one run. |
cross-trajectory | Compares across multiple trajectories (A/B testing). |
Score kind
Section titled “Score kind”How was the score produced?
| Value | Meaning |
|---|---|
code | Computed deterministically by code. |
llm | Produced by an LLM judge. |
human | Assigned by a human reviewer. |
Using taxonomy to choose graders
Section titled “Using taxonomy to choose graders”Taxonomy is descriptive metadata. When you write an eval.yaml, you explicitly list the graders you want to run; the taxonomy fields help you decide which graders are appropriate for each loop.
Inner loop (lint only)
Section titled “Inner loop (lint only)”For the inner loop, you just run vally lint — no eval.yaml needed. Lint runs Vally’s built-in static skill checks (spec compliance, valid refs, orphan files). Result: instant feedback, zero API calls.
CI loop
Section titled “CI loop”In your eval.yaml, include only static and complex-static graders (like output-contains, file-exists, run-command). Set runs: 3 for basic confidence:
config: runs: 3stimuli: - name: basic-check prompt: "..." graders: - type: file-exists # static, low cost config: { path: "*.test.js" } - type: run-command # complex-static, low cost config: { command: "npm test" }Outer loop
Section titled “Outer loop”Add prompt graders (LLM judges) and increase runs for statistical confidence:
config: runs: 5 judge_model: gpt-5.5stimuli: - name: quality-check prompt: "..." graders: - type: file-exists config: { path: "*.test.js" } - type: prompt # llm, high cost config: prompt: "Are the tests comprehensive?"Declaring taxonomy in your graders
Section titled “Declaring taxonomy in your graders”When you write a custom grader, you declare its taxonomy in the metadata property:
import type { Grader, GraderMetadata, GraderInput, GraderResult } from "@microsoft/vally";
export class MyCustomGrader implements Grader { metadata: GraderMetadata = { name: "my-custom-check", description: "Checks something specific to my domain", behavior: { execution: "single" }, determinism: "complex-static", costProfile: "low", portability: "t2-domain", reference: "reference-free", temporalScope: "trajectory-level", };
async grade(input: GraderInput): Promise<GraderResult> { // your grading logic here }}Declare your grader’s taxonomy honestly. The fields are used by eval authors to decide whether to include the grader in fast inner-loop runs or reserve it for outer-loop evaluation.
Next steps
Section titled “Next steps”- Add to CI — set up lint and eval in GitHub Actions
- Writing custom graders — build graders with correct taxonomy
- Grader catalog — browse built-in graders and their taxonomy