Grader Taxonomy

Every grader in Vally declares a set of taxonomy metadata — structured properties that describe what the grader does and how much it costs. You use this metadata to decide which graders to include in your eval.yaml for each context (inner loop, CI, outer loop).

Behavior vs. taxonomy

Grader metadata separates two concerns:

Behavior (metadata.behavior) — how the pipeline treats the grader at runtime:
- requiresLlmClient: whether the grader needs an LLM client to be provisioned
- requiresWorkspace: whether the grader needs filesystem access to the trajectory workspace
A grader can also compare trajectories head-to-head by implementing the optional compare() method (see Writing Custom Graders). The built-in prompt grader does this to power compare.
Taxonomy (the remaining fields) — descriptive classification shown in reports and used when authoring evals to pick the right graders for a given loop.

This separation ensures that adding a new grader to a plugin package doesn’t require understanding pipeline internals — you just set behavior and the system handles the rest.

The taxonomy dimensions

Every grader declares these five properties in its metadata:

Determinism

How reproducible is this grader’s output?

Value	Meaning	Example
`static`	Always produces the same result for the same input. No randomness, no external calls.	`output-contains`, `file-exists`, `output-matches`
`complex-static`	Deterministic but computationally heavier (regex, AST parsing, command execution).	`run-command`
`slm`	Uses a small language model. Mostly reproducible but not guaranteed.	Embedding similarity checker
`llm`	Uses a large language model. Non-deterministic, requires multiple trials for confidence.	LLM judge, comparison judge

Cost profile

How expensive is this grader to run?

Value	Meaning	Typical use
`free`	No compute cost beyond basic CPU.	String matching, file checks
`low`	Minimal cost — local computation, small I/O.	Command execution, glob matching
`medium`	Moderate cost — SLM inference, API calls.	Embedding models
`high`	Significant cost — LLM inference, multi-turn evaluation.	GPT-5.5 judges, A/B comparisons

Reference requirement

Does the grader need a “correct answer” to compare against?

Value	Meaning
`reference-free`	Grades based only on the output/trajectory itself.
`reference-based`	Needs a gold-standard reference answer to compare against.

Temporal scope

What part of the trajectory does the grader inspect?

Value	Meaning
`point-in-time`	Looks at a single moment (e.g., final output).
`trajectory-level`	Inspects the full event sequence of one run.
`cross-trajectory`	Compares across multiple trajectories (A/B testing).

Score kind

How was the score produced?

Value	Meaning
`code`	Computed deterministically by code.
`llm`	Produced by an LLM judge.
`human`	Assigned by a human reviewer.

Using taxonomy to choose graders

Taxonomy is descriptive metadata. When you write an eval.yaml, you explicitly list the graders you want to run; the taxonomy fields help you decide which graders are appropriate for each loop.

Inner loop (lint only)

For the inner loop, you just run vally lint — no eval.yaml needed. Lint runs Vally’s built-in static skill checks (spec compliance, valid refs). Result: instant feedback, zero API calls.

CI loop

In your eval.yaml, include only static and complex-static graders (like output-contains, file-exists, run-command). Set runs: 3 for basic confidence:

defaults:
  runs: 3
stimuli:
  - name: basic-check
    prompt: "..."
    graders:
      - type: file-exists # static, low cost
        config: { path: "*.test.js" }
      - type: run-command # complex-static, low cost
        config: { command: "npm test" }

Outer loop

Add prompt graders (LLM judges) and increase runs for statistical confidence:

defaults:
  runs: 5
  judge_model: gpt-5.5
stimuli:
  - name: quality-check
    prompt: "..."
    graders:
      - type: file-exists
        config: { path: "*.test.js" }
      - type: prompt # llm, high cost
        config:
          prompt: "Are the tests comprehensive?"

Declaring taxonomy in your graders

When you write a custom grader, you declare its taxonomy in the metadata property:

import type { Grader, GraderMetadata, GraderInput, GraderResult } from "@microsoft/vally";

export class MyCustomGrader implements Grader {
  metadata: GraderMetadata = {
    name: "my-custom-check",
    description: "Checks something specific to my domain",
    behavior: {},
    determinism: "complex-static",
    costProfile: "low",
    reference: "reference-free",
    temporalScope: "trajectory-level",
  };

  async grade(input: GraderInput): Promise<GraderResult> {
    // your grading logic here
  }
}

Declare your grader’s taxonomy honestly. The fields are used by eval authors to decide whether to include the grader in fast inner-loop runs or reserve it for outer-loop evaluation.

Next steps

Add to CI — set up lint and eval in GitHub Actions
Writing custom graders — build graders with correct taxonomy
Grader catalog — browse built-in graders and their taxonomy