Skip to content

Grader Catalog

Vally ships with these built-in graders. They cover a range from instant static checks to LLM-powered judging.

GraderDescriptionCostDeterminismPortability
completedCheck session health: non-empty output, no errors, turn completedfreestatict1-universal
error-countLimit the number of error eventsfreestatict1-universal
exit-successCheck agent produced non-empty output (legacy compat)freestatict1-universal
file-containsCheck if a file contains specific contentlowstatict1-universal
file-existsCheck if a file exists in the workspacelowstatict1-universal
file-matchesCheck if a file’s content matches a regexlowstatict1-universal
file-not-containsCheck if a file does NOT contain specific contentlowstatict1-universal
file-not-existsCheck if a file does NOT exist in the workspacelowstatict1-universal
file-not-matchesCheck if no file’s content matches a regexlowstatict1-universal
output-containsCheck if output contains a substringlowstatict1-universal
output-matchesCheck if output matches a regex patternfreestatict1-universal
output-not-containsCheck if output does NOT contain a substringlowstatict1-universal
output-not-matchesCheck if output does NOT match a regex patternfreestatict1-universal
pairwiseLLM judge — compare two trajectories A vs Bhighllmt3a-scenario
panelLLM panel — N judges in parallel with consensus aggregationhighllmt3a-scenario
programRun an arbitrary program as a graderlowstatict3a-scenario
promptLLM judge — evaluate output against a rubrichighllmt3a-scenario
run-commandRun a shell command and check exit code/outputlowcomplex-statict2-domain
skill-invocationCheck required/disallowed skill activationsfreestatict1-universal
token-budgetEnforce a total token budgetfreestatict1-universal
tool-call-countLimit the number of tool callsfreestatict1-universal
tool-callsValidate required/disallowed/ordered tool callslowstatict1-universal
turn-countLimit the number of agent turnsfreestatict1-universal
wall-timeEnforce a wall-clock time budget (accepts duration strings)freestatict1-universal

All graders are reference-free (no gold-standard answer needed) and trajectory-level (inspect the full run) — except pairwise, which is reference-based (requires two trajectories to compare).

The prompt, pairwise, and panel graders are LLM graders with high cost. prompt and panel run in eval and grade; pairwise runs only in compare.

If these don’t cover your needs, see Writing Custom Graders. The Grader interface is simple — implement metadata and grade().