Grader Catalog

Vally ships with these built-in graders. They cover a range from instant static checks to LLM-powered judging.

At a glance

Grader	Description	Cost	Determinism
`completed`	Check session health: non-empty output, no errors, turn completed	free	static
`custom-metrics`	Assert on numeric/boolean/string metrics emitted to a JSON file	low	static
`diff-contains`	Check if the workspace diff matches a pattern	low	complex-static
`diff-not-contains`	Check if the workspace diff does NOT match a pattern	low	complex-static
`diff-empty`	Check if the stimulus produced no workspace changes	low	complex-static
`error-count`	Limit the number of error events	free	static
`exit-success`	Check agent produced non-empty output (legacy compat)	free	static
`file-contains`	Check if a file contains specific content	low	static
`file-exists`	Check if a file exists in the workspace	low	static
`file-matches`	Check if a file’s content matches a regex	low	static
`file-not-contains`	Check if a file does NOT contain specific content	low	static
`file-not-exists`	Check if a file does NOT exist in the workspace	low	static
`file-not-matches`	Check if no file’s content matches a regex	low	static
`loop-outcome`	Assert that an agent did (or didn’t) get stuck in a loop	free	static
`max-repeat`	Fail when an action repeats too many times consecutively (loop detection)	free	static
`output-contains`	Check if output contains a substring	free	static
`output-matches`	Check if output matches a regex pattern	free	static
`output-not-contains`	Check if output does NOT contain a substring	free	static
`output-not-matches`	Check if output does NOT match a regex pattern	free	static
`panel`	LLM panel — N judges in parallel with consensus aggregation	high	llm
`program`	Run an arbitrary program as a grader	low	static
`prompt`	LLM judge — evaluate output against a rubric (also powers `compare`)	high	llm
`run-command`	Run a shell command and check exit code/output	low	complex-static
`skill-invocation`	Check required/disallowed skill activations	free	static
`token-budget`	Enforce a total token budget	free	static
`tool-call-count`	Limit the number of tool calls	free	static
`tool-calls`	Validate required/disallowed/ordered tool calls	low	static
`transcript-contains`	Check if any assistant message contains a substring	free	static
`transcript-matches`	Check if any assistant message matches a regex pattern	free	static
`transcript-not-contains`	Check if NO assistant message contains a substring	free	static
`transcript-not-matches`	Check if NO assistant message matches a regex pattern	free	static
`turn-count`	Limit the number of agent turns	free	static
`wall-time`	Enforce a wall-clock time budget (accepts duration strings)	free	static

All graders are reference-free (no gold-standard answer needed) and trajectory-level (inspect the full run).

The prompt and panel graders are LLM graders with high cost, and both run in eval and grade. The prompt grader additionally supports a comparison mode (baseline vs. treatment) that runs via compare.

Writing your own

If these don’t cover your needs, see Writing Custom Graders. The Grader interface is simple — implement metadata and grade().