Grader Catalog
Vally ships with these built-in graders. They cover a range from instant static checks to LLM-powered judging.
At a glance
Section titled “At a glance”| Grader | Description | Cost | Determinism | Portability |
|---|---|---|---|---|
completed | Check session health: non-empty output, no errors, turn completed | free | static | t1-universal |
error-count | Limit the number of error events | free | static | t1-universal |
exit-success | Check agent produced non-empty output (legacy compat) | free | static | t1-universal |
file-contains | Check if a file contains specific content | low | static | t1-universal |
file-exists | Check if a file exists in the workspace | low | static | t1-universal |
file-matches | Check if a file’s content matches a regex | low | static | t1-universal |
file-not-contains | Check if a file does NOT contain specific content | low | static | t1-universal |
file-not-exists | Check if a file does NOT exist in the workspace | low | static | t1-universal |
file-not-matches | Check if no file’s content matches a regex | low | static | t1-universal |
output-contains | Check if output contains a substring | low | static | t1-universal |
output-matches | Check if output matches a regex pattern | free | static | t1-universal |
output-not-contains | Check if output does NOT contain a substring | low | static | t1-universal |
output-not-matches | Check if output does NOT match a regex pattern | free | static | t1-universal |
pairwise | LLM judge — compare two trajectories A vs B | high | llm | t3a-scenario |
panel | LLM panel — N judges in parallel with consensus aggregation | high | llm | t3a-scenario |
program | Run an arbitrary program as a grader | low | static | t3a-scenario |
prompt | LLM judge — evaluate output against a rubric | high | llm | t3a-scenario |
run-command | Run a shell command and check exit code/output | low | complex-static | t2-domain |
skill-invocation | Check required/disallowed skill activations | free | static | t1-universal |
token-budget | Enforce a total token budget | free | static | t1-universal |
tool-call-count | Limit the number of tool calls | free | static | t1-universal |
tool-calls | Validate required/disallowed/ordered tool calls | low | static | t1-universal |
turn-count | Limit the number of agent turns | free | static | t1-universal |
wall-time | Enforce a wall-clock time budget (accepts duration strings) | free | static | t1-universal |
All graders are reference-free (no gold-standard answer needed) and trajectory-level (inspect the full run) — except pairwise, which is reference-based (requires two trajectories to compare).
The prompt, pairwise, and panel graders are LLM graders with high cost. prompt and panel run in eval and grade; pairwise runs only in compare.
Writing your own
Section titled “Writing your own”If these don’t cover your needs, see Writing Custom Graders. The Grader interface is simple — implement metadata and grade().