Grader: prompt (LLM judge)

The prompt grader sends the agent’s trajectory to an LLM (the “judge”) and asks it to evaluate quality against a rubric. This is the most powerful built-in grader — it can assess things no static check can, like whether an explanation is clear or whether code follows best practices.

Taxonomy

Property	Value
Determinism	`llm`
Cost	`high`
Reference	`reference-free`
Temporal scope	`trajectory-level`
Score kind	`llm`

Config

graders:
  - type: prompt
    config:
      prompt: "Did the agent produce working unit tests?"
      model: gpt-5.5 # optional: override judge model
      scoring: scale_1_5 # optional: scoring scale
      threshold: 0.5 # optional: pass threshold

Field	Type	Required	Default	Description
`prompt`	string	No	Uses default rubric	Custom evaluation prompt for the judge
`model`	string	No	`claude-sonnet-4.6` (or `--judge-model`)	Which model to use as the judge
`scoring`	`"binary"` \| `"scale_1_5"` \| `"scale_1_10"`	No	`"scale_1_5"`	Scoring scale for the judge’s response
`threshold`	number	No	`0.5`	Score threshold for passing (normalized to 0–1)

Scoring scales

Scale	Range	Default threshold	When to use
`binary`	0 or 1	0.5	Simple yes/no judgments
`scale_1_5`	1–5	0.5 (≈ 3/5)	General-purpose evaluation
`scale_1_10`	1–10	0.5 (≈ 5.5/10)	Fine-grained quality assessment

Scores are normalized to [0, 1] regardless of scale.

How it works

The trajectory is formatted into a readable timeline (tool calls, messages, outputs) with head/tail windowing for long trajectories
A system prompt instructs the judge to evaluate against the rubric criteria
The judge must call the submit_grade tool exactly once with per-criterion scores and reasoning
If the judge fails to call the tool or sends invalid arguments, an in-session reminder nudges it to retry (up to 2 times)
Per-criterion results are mapped to GraderResult.details sub-checks
The overall score is normalized to [0, 1]

Judge model resolution

The model used for judging follows this priority chain:

Grader-level config.model in eval.yaml (most specific)
--judge-model CLI flag
defaults.judge_model in eval.yaml (global default for all LLM graders)
EVAL_JUDGE_MODEL environment variable
Default: claude-sonnet-4.6

defaults:
  model: gpt-5.5 # agent execution model
  judge_model: gpt-5.5 # default judge model for all LLM graders

stimuli:
  - name: test-case
    graders:
      - type: prompt
        config:
          model: o3 # this specific grader uses o3
      - type: prompt # this one uses gpt-5.5 (from judge_model)

Judge reasoning effort

Set defaults.judge_reasoning_effort to control the reasoning effort of the judge model (low, medium, high, or xhigh). It applies only to the eval-level judge_model — graders that pin their own config.model keep the model’s default effort. When unset, the judge runs at whatever the model’s default effort is, which is neither controlled nor recorded. The effective value is recorded in the grader result metadata as reasoning_effort.

defaults:
  judge_model: claude-opus-4.6
  judge_reasoning_effort: high # judge deliberates harder, reproducibly

A per-grader config.reasoning_effort overrides the eval-level default and is the natural companion to a per-grader model:

graders:
  - type: prompt
    config:
      model: o3
      reasoning_effort: high # effort for this grader's own judge model

Reasoning effort only takes effect on judge models that support it. vally does not validate this — the value is passed through to the model provider, which may ignore or reject it on an unsupporting model. Consult your model provider’s documentation for which models support reasoning effort and which levels they accept.

Rubric and evaluation criteria

If you don’t provide a custom prompt, the default rubric evaluates:

Task completion — Did the agent accomplish what was asked?
Correctness — Is the output factually/technically correct?
Quality — Is the output well-structured and clear?

Custom prompts let you focus on domain-specific criteria:

graders:
  - type: prompt
    config:
      prompt: |
        Evaluate whether the agent's unit tests are comprehensive:
        1. Do the tests cover edge cases (empty input, nulls, errors)?
        2. Are assertions specific (not just "toBeTruthy")?
        3. Do the tests actually run (valid syntax, proper imports)?

Evidence examples

✔ prompt  Score: 4.2/5 (0.84) — task_completion: 5/5, correctness: 4/5, quality: 4/5
✘ prompt  Score: 2.1/5 (0.42) — task_completion: 3/5, correctness: 1/5, quality: 2/5

Each criterion score is available as a sub-result in details:

✘ prompt (score: 0.42)
  ✔ task_completion (5/5): Agent attempted all requested tasks
  ✘ correctness (1/5): Generated code has syntax errors and wrong imports
  ✘ quality (2/5): Output is disorganized with no explanation

Retry behavior

LLM calls can fail due to rate limits or transient errors. The prompt grader retries with exponential backoff:

Up to 2 retries (3 total attempts)
Exponential backoff with jitter (5s → 10s → 20s + random 0–1s)
10 minute total budget — won’t retry past this limit
If all retries fail, the grader returns a failed result (score 0) with the error in evidence, rather than crashing the eval

Cost considerations

Every trial graded by a prompt grader makes at least one LLM API call. With multi-trial:

Cost ≈ (num_stimuli × runs × prompt_graders_per_stimulus) × per-call cost

For example, 5 stimuli × 5 runs × 1 prompt grader = 25 LLM judge calls per eval.

Use --judge-model to control costs: a smaller model for iteration, a larger model for final evaluation.

Comparison mode

The prompt grader also powers head-to-head comparison: instead of scoring one trajectory, it judges a baseline against a treatment for the same stimulus and rubric, and reports which is better and by how much. Run it with vally compare — any stimulus with a rubric can be compared.

How it works

A baseline trajectory and a treatment trajectory for the same stimulus are loaded (from an experiment’s variants, or two independent runs).
The judge compares them against the rubric and submits a verdict via a submit_comparison_grade tool call.
To remove order bias, the comparison runs twice with the two responses swapped (position-swap debiasing). If the directions disagree on the winner, the result is a tie; if they agree on the winner but not the magnitude, the weaker magnitude wins.

Scoring scale

The verdict is signed and treatment-relative, in [-1, 1]:

Verdict	Score	Meaning
much better	`+1.0`	treatment is clearly better
slightly better	`+0.4`	treatment is somewhat better
equal / tie	`0`	no meaningful difference
slightly worse	`-0.4`	baseline is somewhat better
much worse	`-1.0`	baseline is clearly better

See the compare CLI reference for usage, statistics, and examples.