Skip to content

Grader: panel (LLM panel of judges)

The panel grader sends the agent’s trajectory to N LLM judges in parallel and combines their normalized scores with a configurable aggregation strategy. Use it when a single judge is too noisy or biased and you want a more robust signal — for example, mixing models from different providers as a basic ensemble.

PropertyValue
Determinismllm
Costhigh (one LLM call per judge per stimulus)
Portabilityt3a-scenario
Referencereference-free
Temporal scopetrajectory-level
Score kindllm

Quick walkthrough — enable a panel grader

Section titled “Quick walkthrough — enable a panel grader”

This walkthrough turns a working single-judge eval into a 3-model panel in three small steps. Start from any eval that already uses the prompt grader, or use the minimal example below.

eval.yaml
config:
executor: copilot-sdk
runs: 1
stimuli:
- name: explain-binary-search
prompt: "Explain how binary search works in 3 sentences."
rubric:
- "Mentions sorted input is required"
- "Describes halving the search range"
- "Stays at roughly 3 sentences"
graders:
- type: prompt

Run it once to confirm the agent and the single judge work end-to-end:

Terminal window
vally eval --eval-spec eval.yaml

2. Swap prompt for panel and list the judges

Section titled “2. Swap prompt for panel and list the judges”

Replace the single prompt grader with a panel and provide a models list. The panel reuses the stimulus rubric, so you don’t need to rewrite your criteria:

eval.yaml
stimuli:
- name: explain-binary-search
prompt: "Explain how binary search works in 3 sentences."
rubric:
- "Mentions sorted input is required"
- "Describes halving the search range"
- "Stays at roughly 3 sentences"
graders:
- type: panel
config:
models:
- claude-sonnet-4.6
- gpt-5.5
- gpt-5.5-mini

That’s the smallest valid panel configuration: just models. The grader defaults to aggregation: mean and the standard scale_1_5 scoring scale. Each judge receives the same trajectory, system prompt, and rubric.

Terminal window
vally eval --eval-spec eval.yaml --verbose

In the console output you’ll see one top-level panel line plus a panel/<model> sub-result per judge:

✔ panel PASS via mean aggregation across 3 judge(s):
claude-sonnet-4.6=0.80, gpt-5.5=0.60, gpt-5.5-mini=0.80 → 0.73
✔ panel/claude-sonnet-4.6 Clear, mentions sorted input...
✔ panel/gpt-5.5 Concise but skips the sorted-input precondition.
✔ panel/gpt-5.5-mini Good explanation of halving...

The aggregate score is what’s compared against threshold for pass/fail.

Once the panel is wired up, dial in the behavior you want:

graders:
- type: panel
config:
models: [claude-sonnet-4.6, gpt-5.5, gpt-5.5-mini]
aggregation: majority # require a majority vote, not just a mean
threshold: 0.6 # pass when aggregate ≥ 0.6 (normalized)

That’s it — your eval now has a panel grader.

graders:
- type: panel
config:
models: [claude-sonnet-4.6, gpt-5.5, gpt-5.5-mini]
aggregation: mean
scoring: scale_1_5
threshold: 0.5
prompt: "Optional extra instructions for every judge."
FieldTypeRequiredDefaultDescription
modelsstring[]YesOne judge model per entry. Each model produces one independent verdict.
aggregationenumNo"mean"How to combine per-judge scores. See Aggregation strategies.
scoring"binary" | "scale_1_5" | "scale_1_10"No"scale_1_5"Per-judge scoring scale (same options as the prompt grader).
thresholdnumberNo0.5 (scoring-dependent)Normalized [0, 1] cutoff. Aggregate score ≥ threshold ⇒ panel passes. Defaults to DEFAULT_THRESHOLDS[scoring] (currently 0.5 for every scale, but the per-scale default may change).
promptstringNo(none)Optional extra instructions appended to every judge’s prompt. The stimulus rubric is always included.
StrategyWhat it doesResult rangeUse when
meanArithmetic average of judge scores.[0, 1]You want a smooth, balanced consensus. Default.
medianMiddle judge score (or average of the two middles for even counts).[0, 1]You want robustness against one outlier judge.
minLowest judge score.[0, 1]Strict: any single judge can fail the panel.
majority1 when more than half the judges pass at threshold, 0 otherwise. Ties (common for even-sized panels) fall back to mean.[0, 1]You want a binary majority vote, e.g., LLM-as-jury.

Judges that fail or return invalid responses are dropped before aggregation. The aggregate is always computed from surviving judges; the panel only fails outright when all judges fail.

A panel never fails because of a single judge. When a judge throws (rate limit, timeout, malformed JSON, etc.):

  • The judge is recorded in metadata.failed_judges with the error message.
  • A failed panel/<model> sub-result is added to details so reporters still surface it.
  • Aggregation runs over the remaining judges.
  • If every judge fails, the panel returns passed: false, score: 0, and the evidence lists each failure.

Each panel grade exposes a top-level GraderResult plus per-judge sub-results in details:

{
"name": "panel",
"kind": "llm",
"passed": true,
"score": 0.73,
"evidence": "PASS via mean aggregation across 3 judge(s): ...",
"details": [
{
"name": "panel/claude-sonnet-4.6",
"passed": true,
"score": 0.8,
"evidence": "Clear, covers sorted input...",
"details": [
{
"name": "panel/claude-sonnet-4.6/Mentions sorted input is required",
"passed": true,
"score": 1.0,
"evidence": "...",
},
// one per rubric criterion
],
"metadata": {
"model": "claude-sonnet-4.6",
"token_usage": {
/* ... */
},
},
},
// one per judge
],
"metadata": {
"aggregation": "mean",
"threshold": 0.5,
"scoring": "scale_1_5",
"models": ["claude-sonnet-4.6", "gpt-5.5", "gpt-5.5-mini"],
"disagreement": 0.2,
// failed_judges is present only when at least one judge failed.
},
}

Per-judge information (score, latency, token usage, reasoning) lives in details[*] rather than being duplicated into metadata. Walk details to inspect what each judge returned.

JSONL and JUnit reporters preserve the full details and metadata tree, so downstream tools have access to every per-judge score and reasoning trace.

Every judge in models makes at least one LLM call per stimulus per trial:

Cost ≈ stimuli × runs × judges × per-call cost

Two ways to keep cost in check while iterating:

  1. Mix tiers. Pair one strong judge (e.g., claude-sonnet-4.6) with one or two cheaper judges (e.g., gpt-5.5-mini).
  2. Keep models lean. Every entry in models is one extra call per stimulus per trial; only add judges that earn their cost. Global LLM rate-limit / throughput control is owned by the runner and LlmClient, not by the panel grader itself.