Grader: panel (LLM panel of judges)
The panel grader sends the agent’s trajectory to N LLM judges in parallel and combines their normalized scores with a configurable aggregation strategy. Use it when a single judge is too noisy or biased and you want a more robust signal — for example, mixing models from different providers as a basic ensemble.
Taxonomy
Section titled “Taxonomy”| Property | Value |
|---|---|
| Determinism | llm |
| Cost | high (one LLM call per judge per stimulus) |
| Portability | t3a-scenario |
| Reference | reference-free |
| Temporal scope | trajectory-level |
| Score kind | llm |
Quick walkthrough — enable a panel grader
Section titled “Quick walkthrough — enable a panel grader”This walkthrough turns a working single-judge eval into a 3-model panel in three small steps. Start from any eval that already uses the prompt grader, or use the minimal example below.
1. Start with a single-judge baseline
Section titled “1. Start with a single-judge baseline”config: executor: copilot-sdk runs: 1
stimuli: - name: explain-binary-search prompt: "Explain how binary search works in 3 sentences." rubric: - "Mentions sorted input is required" - "Describes halving the search range" - "Stays at roughly 3 sentences" graders: - type: promptRun it once to confirm the agent and the single judge work end-to-end:
vally eval --eval-spec eval.yaml2. Swap prompt for panel and list the judges
Section titled “2. Swap prompt for panel and list the judges”Replace the single prompt grader with a panel and provide a models list. The panel reuses the stimulus rubric, so you don’t need to rewrite your criteria:
stimuli: - name: explain-binary-search prompt: "Explain how binary search works in 3 sentences." rubric: - "Mentions sorted input is required" - "Describes halving the search range" - "Stays at roughly 3 sentences" graders: - type: panel config: models: - claude-sonnet-4.6 - gpt-5.5 - gpt-5.5-miniThat’s the smallest valid panel configuration: just models. The grader defaults to aggregation: mean and the standard scale_1_5 scoring scale. Each judge receives the same trajectory, system prompt, and rubric.
3. Run the eval and read the panel result
Section titled “3. Run the eval and read the panel result”vally eval --eval-spec eval.yaml --verboseIn the console output you’ll see one top-level panel line plus a panel/<model> sub-result per judge:
✔ panel PASS via mean aggregation across 3 judge(s): claude-sonnet-4.6=0.80, gpt-5.5=0.60, gpt-5.5-mini=0.80 → 0.73 ✔ panel/claude-sonnet-4.6 Clear, mentions sorted input... ✔ panel/gpt-5.5 Concise but skips the sorted-input precondition. ✔ panel/gpt-5.5-mini Good explanation of halving...The aggregate score is what’s compared against threshold for pass/fail.
4. (Optional) Tighten the panel
Section titled “4. (Optional) Tighten the panel”Once the panel is wired up, dial in the behavior you want:
graders: - type: panel config: models: [claude-sonnet-4.6, gpt-5.5, gpt-5.5-mini] aggregation: majority # require a majority vote, not just a mean threshold: 0.6 # pass when aggregate ≥ 0.6 (normalized)That’s it — your eval now has a panel grader.
Config
Section titled “Config”graders: - type: panel config: models: [claude-sonnet-4.6, gpt-5.5, gpt-5.5-mini] aggregation: mean scoring: scale_1_5 threshold: 0.5 prompt: "Optional extra instructions for every judge."| Field | Type | Required | Default | Description |
|---|---|---|---|---|
models | string[] | Yes | — | One judge model per entry. Each model produces one independent verdict. |
aggregation | enum | No | "mean" | How to combine per-judge scores. See Aggregation strategies. |
scoring | "binary" | "scale_1_5" | "scale_1_10" | No | "scale_1_5" | Per-judge scoring scale (same options as the prompt grader). |
threshold | number | No | 0.5 (scoring-dependent) | Normalized [0, 1] cutoff. Aggregate score ≥ threshold ⇒ panel passes. Defaults to DEFAULT_THRESHOLDS[scoring] (currently 0.5 for every scale, but the per-scale default may change). |
prompt | string | No | (none) | Optional extra instructions appended to every judge’s prompt. The stimulus rubric is always included. |
Aggregation strategies
Section titled “Aggregation strategies”| Strategy | What it does | Result range | Use when |
|---|---|---|---|
mean | Arithmetic average of judge scores. | [0, 1] | You want a smooth, balanced consensus. Default. |
median | Middle judge score (or average of the two middles for even counts). | [0, 1] | You want robustness against one outlier judge. |
min | Lowest judge score. | [0, 1] | Strict: any single judge can fail the panel. |
majority | 1 when more than half the judges pass at threshold, 0 otherwise. Ties (common for even-sized panels) fall back to mean. | [0, 1] | You want a binary majority vote, e.g., LLM-as-jury. |
Judges that fail or return invalid responses are dropped before aggregation. The aggregate is always computed from surviving judges; the panel only fails outright when all judges fail.
Failure isolation
Section titled “Failure isolation”A panel never fails because of a single judge. When a judge throws (rate limit, timeout, malformed JSON, etc.):
- The judge is recorded in
metadata.failed_judgeswith the error message. - A failed
panel/<model>sub-result is added todetailsso reporters still surface it. - Aggregation runs over the remaining judges.
- If every judge fails, the panel returns
passed: false,score: 0, and the evidence lists each failure.
Result shape
Section titled “Result shape”Each panel grade exposes a top-level GraderResult plus per-judge sub-results in details:
{ "name": "panel", "kind": "llm", "passed": true, "score": 0.73, "evidence": "PASS via mean aggregation across 3 judge(s): ...", "details": [ { "name": "panel/claude-sonnet-4.6", "passed": true, "score": 0.8, "evidence": "Clear, covers sorted input...", "details": [ { "name": "panel/claude-sonnet-4.6/Mentions sorted input is required", "passed": true, "score": 1.0, "evidence": "...", }, // one per rubric criterion ], "metadata": { "model": "claude-sonnet-4.6", "token_usage": { /* ... */ }, }, }, // one per judge ], "metadata": { "aggregation": "mean", "threshold": 0.5, "scoring": "scale_1_5", "models": ["claude-sonnet-4.6", "gpt-5.5", "gpt-5.5-mini"], "disagreement": 0.2, // failed_judges is present only when at least one judge failed. },}Per-judge information (score, latency, token usage, reasoning) lives in details[*] rather than being duplicated into metadata. Walk details to inspect what each judge returned.
JSONL and JUnit reporters preserve the full details and metadata tree, so downstream tools have access to every per-judge score and reasoning trace.
Cost considerations
Section titled “Cost considerations”Every judge in models makes at least one LLM call per stimulus per trial:
Cost ≈ stimuli × runs × judges × per-call costTwo ways to keep cost in check while iterating:
- Mix tiers. Pair one strong judge (e.g.,
claude-sonnet-4.6) with one or two cheaper judges (e.g.,gpt-5.5-mini). - Keep
modelslean. Every entry inmodelsis one extra call per stimulus per trial; only add judges that earn their cost. Global LLM rate-limit / throughput control is owned by the runner andLlmClient, not by the panel grader itself.