Skip to content

Grader: prompt (LLM judge)

The prompt grader sends the agent’s trajectory to an LLM (the “judge”) and asks it to evaluate quality against a rubric. This is the most powerful built-in grader — it can assess things no static check can, like whether an explanation is clear or whether code follows best practices.

PropertyValue
Determinismllm
Costhigh
Portabilityt3a-scenario
Referencereference-free
Temporal scopetrajectory-level
Score kindllm
graders:
- type: prompt
config:
prompt: "Did the agent produce working unit tests?"
model: gpt-5.5 # optional: override judge model
scoring: scale_1_5 # optional: scoring scale
threshold: 0.5 # optional: pass threshold
FieldTypeRequiredDefaultDescription
promptstringNoUses default rubricCustom evaluation prompt for the judge
modelstringNoclaude-sonnet-4.6 (or --judge-model)Which model to use as the judge
scoring"binary" | "scale_1_5" | "scale_1_10"No"scale_1_5"Scoring scale for the judge’s response
thresholdnumberNo0.5Score threshold for passing (normalized to 0–1)
ScaleRangeDefault thresholdWhen to use
binary0 or 10.5Simple yes/no judgments
scale_1_51–50.5 (≈ 3/5)General-purpose evaluation
scale_1_101–100.5 (≈ 5.5/10)Fine-grained quality assessment

Scores are normalized to [0, 1] regardless of scale.

  1. The trajectory is formatted into a readable timeline (tool calls, messages, outputs) with head/tail windowing for long trajectories
  2. A system prompt instructs the judge to evaluate against the rubric criteria
  3. The judge must call the submit_grade tool exactly once with per-criterion scores and reasoning
  4. If the judge fails to call the tool or sends invalid arguments, an in-session reminder nudges it to retry (up to 2 times)
  5. Per-criterion results are mapped to GraderResult.details sub-checks
  6. The overall score is normalized to [0, 1]

The model used for judging follows this priority chain:

  1. Grader-level config.model in eval.yaml (most specific)
  2. --judge-model CLI flag
  3. config.judge_model in eval.yaml (global default for all LLM graders)
  4. Default: claude-sonnet-4.6
eval.yaml
config:
model: gpt-5.5 # agent execution model
judge_model: gpt-5.5 # default judge model for all LLM graders
stimuli:
- name: test-case
graders:
- type: prompt
config:
model: o3 # this specific grader uses o3
- type: prompt # this one uses gpt-5.5 (from judge_model)

If you don’t provide a custom prompt, the default rubric evaluates:

  • Task completion — Did the agent accomplish what was asked?
  • Correctness — Is the output factually/technically correct?
  • Quality — Is the output well-structured and clear?

Custom prompts let you focus on domain-specific criteria:

graders:
- type: prompt
config:
prompt: |
Evaluate whether the agent's unit tests are comprehensive:
1. Do the tests cover edge cases (empty input, nulls, errors)?
2. Are assertions specific (not just "toBeTruthy")?
3. Do the tests actually run (valid syntax, proper imports)?
✔ prompt Score: 4.2/5 (0.84) — task_completion: 5/5, correctness: 4/5, quality: 4/5
✘ prompt Score: 2.1/5 (0.42) — task_completion: 3/5, correctness: 1/5, quality: 2/5

Each criterion score is available as a sub-result in details:

✘ prompt (score: 0.42)
✔ task_completion (5/5): Agent attempted all requested tasks
✘ correctness (1/5): Generated code has syntax errors and wrong imports
✘ quality (2/5): Output is disorganized with no explanation

LLM calls can fail due to rate limits or transient errors. The prompt grader retries with exponential backoff:

  • Up to 2 retries (3 total attempts)
  • Exponential backoff with jitter (5s → 10s → 20s + random 0–1s)
  • 10 minute total budget — won’t retry past this limit
  • If all retries fail, the grader returns a failed result (score 0) with the error in evidence, rather than crashing the eval

Every trial graded by a prompt grader makes at least one LLM API call. With multi-trial:

Cost ≈ (num_stimuli × runs × prompt_graders_per_stimulus) × per-call cost

For example, 5 stimuli × 5 runs × 1 prompt grader = 25 LLM judge calls per eval.

Use --judge-model to control costs: a smaller model for iteration, a larger model for final evaluation.