Grader: prompt (LLM judge)
The prompt grader sends the agent’s trajectory to an LLM (the “judge”) and asks it to evaluate quality against a rubric. This is the most powerful built-in grader — it can assess things no static check can, like whether an explanation is clear or whether code follows best practices.
Taxonomy
Section titled “Taxonomy”| Property | Value |
|---|---|
| Determinism | llm |
| Cost | high |
| Portability | t3a-scenario |
| Reference | reference-free |
| Temporal scope | trajectory-level |
| Score kind | llm |
Config
Section titled “Config”graders: - type: prompt config: prompt: "Did the agent produce working unit tests?" model: gpt-5.5 # optional: override judge model scoring: scale_1_5 # optional: scoring scale threshold: 0.5 # optional: pass threshold| Field | Type | Required | Default | Description |
|---|---|---|---|---|
prompt | string | No | Uses default rubric | Custom evaluation prompt for the judge |
model | string | No | claude-sonnet-4.6 (or --judge-model) | Which model to use as the judge |
scoring | "binary" | "scale_1_5" | "scale_1_10" | No | "scale_1_5" | Scoring scale for the judge’s response |
threshold | number | No | 0.5 | Score threshold for passing (normalized to 0–1) |
Scoring scales
Section titled “Scoring scales”| Scale | Range | Default threshold | When to use |
|---|---|---|---|
binary | 0 or 1 | 0.5 | Simple yes/no judgments |
scale_1_5 | 1–5 | 0.5 (≈ 3/5) | General-purpose evaluation |
scale_1_10 | 1–10 | 0.5 (≈ 5.5/10) | Fine-grained quality assessment |
Scores are normalized to [0, 1] regardless of scale.
How it works
Section titled “How it works”- The trajectory is formatted into a readable timeline (tool calls, messages, outputs) with head/tail windowing for long trajectories
- A system prompt instructs the judge to evaluate against the rubric criteria
- The judge must call the
submit_gradetool exactly once with per-criterion scores and reasoning - If the judge fails to call the tool or sends invalid arguments, an in-session reminder nudges it to retry (up to 2 times)
- Per-criterion results are mapped to
GraderResult.detailssub-checks - The overall score is normalized to
[0, 1]
Judge model resolution
Section titled “Judge model resolution”The model used for judging follows this priority chain:
- Grader-level
config.modelin eval.yaml (most specific) --judge-modelCLI flagconfig.judge_modelin eval.yaml (global default for all LLM graders)- Default:
claude-sonnet-4.6
config: model: gpt-5.5 # agent execution model judge_model: gpt-5.5 # default judge model for all LLM graders
stimuli: - name: test-case graders: - type: prompt config: model: o3 # this specific grader uses o3 - type: prompt # this one uses gpt-5.5 (from judge_model)Rubric and evaluation criteria
Section titled “Rubric and evaluation criteria”If you don’t provide a custom prompt, the default rubric evaluates:
- Task completion — Did the agent accomplish what was asked?
- Correctness — Is the output factually/technically correct?
- Quality — Is the output well-structured and clear?
Custom prompts let you focus on domain-specific criteria:
graders: - type: prompt config: prompt: | Evaluate whether the agent's unit tests are comprehensive: 1. Do the tests cover edge cases (empty input, nulls, errors)? 2. Are assertions specific (not just "toBeTruthy")? 3. Do the tests actually run (valid syntax, proper imports)?Evidence examples
Section titled “Evidence examples”✔ prompt Score: 4.2/5 (0.84) — task_completion: 5/5, correctness: 4/5, quality: 4/5✘ prompt Score: 2.1/5 (0.42) — task_completion: 3/5, correctness: 1/5, quality: 2/5Each criterion score is available as a sub-result in details:
✘ prompt (score: 0.42) ✔ task_completion (5/5): Agent attempted all requested tasks ✘ correctness (1/5): Generated code has syntax errors and wrong imports ✘ quality (2/5): Output is disorganized with no explanationRetry behavior
Section titled “Retry behavior”LLM calls can fail due to rate limits or transient errors. The prompt grader retries with exponential backoff:
- Up to 2 retries (3 total attempts)
- Exponential backoff with jitter (5s → 10s → 20s + random 0–1s)
- 10 minute total budget — won’t retry past this limit
- If all retries fail, the grader returns a failed result (score 0) with the error in evidence, rather than crashing the eval
Cost considerations
Section titled “Cost considerations”Every trial graded by a prompt grader makes at least one LLM API call. With multi-trial:
Cost ≈ (num_stimuli × runs × prompt_graders_per_stimulus) × per-call costFor example, 5 stimuli × 5 runs × 1 prompt grader = 25 LLM judge calls per eval.
Use --judge-model to control costs: a smaller model for iteration, a larger model for final evaluation.