Grader: panel (LLM panel of judges)

The panel grader sends the agent’s trajectory to N LLM judges in parallel and combines their normalized scores with a configurable aggregation strategy. Use it when a single judge is too noisy or biased and you want a more robust signal — for example, mixing models from different providers as a basic ensemble.

Taxonomy

Property	Value
Determinism	`llm`
Cost	`high` (one LLM call per judge per stimulus)
Reference	`reference-free`
Temporal scope	`trajectory-level`
Score kind	`llm`

Quick walkthrough — enable a panel grader

This walkthrough turns a working single-judge eval into a 3-model panel in three small steps. Start from any eval that already uses the prompt grader, or use the minimal example below.

1. Start with a single-judge baseline

defaults:
  executor: copilot-sdk
  runs: 1

stimuli:
  - name: explain-binary-search
    prompt: "Explain how binary search works in 3 sentences."
    rubric:
      - "Mentions sorted input is required"
      - "Describes halving the search range"
      - "Stays at roughly 3 sentences"
    graders:
      - type: prompt

Run it once to confirm the agent and the single judge work end-to-end:

vally eval --eval-spec eval.yaml

2. Swap `prompt` for `panel` and list the judges

Replace the single prompt grader with a panel and provide a models list. The panel reuses the stimulus rubric, so you don’t need to rewrite your criteria:

stimuli:
  - name: explain-binary-search
    prompt: "Explain how binary search works in 3 sentences."
    rubric:
      - "Mentions sorted input is required"
      - "Describes halving the search range"
      - "Stays at roughly 3 sentences"
    graders:
      - type: panel
        config:
          models:
            - claude-sonnet-4.6
            - gpt-5.5
            - gpt-5.5-mini

That’s the smallest valid panel configuration: just models. The grader defaults to aggregation: mean and the standard scale_1_5 scoring scale. Each judge receives the same trajectory, system prompt, and rubric.

3. Run the eval and read the panel result

vally eval --eval-spec eval.yaml --verbose

In the console output you’ll see one top-level panel line plus a panel/<model> sub-result per judge:

✔ panel  PASS via mean aggregation across 3 judge(s):
         claude-sonnet-4.6=0.80, gpt-5.5=0.60, gpt-5.5-mini=0.80 → 0.73
  ✔ panel/claude-sonnet-4.6  Clear, mentions sorted input...
  ✔ panel/gpt-5.5             Concise but skips the sorted-input precondition.
  ✔ panel/gpt-5.5-mini        Good explanation of halving...

The aggregate score is what’s compared against threshold for pass/fail.

4. (Optional) Tighten the panel

Once the panel is wired up, dial in the behavior you want:

graders:
  - type: panel
    config:
      models: [claude-sonnet-4.6, gpt-5.5, gpt-5.5-mini]
      aggregation: majority # require a majority vote, not just a mean
      threshold: 0.6 # pass when aggregate ≥ 0.6 (normalized)

That’s it — your eval now has a panel grader.

Config

graders:
  - type: panel
    config:
      # Each entry is a bare model name or a { model, reasoning_effort } object.
      models:
        - { model: o3, reasoning_effort: high }
        - { model: claude-opus-4.6, reasoning_effort: medium }
        - gpt-5.5-mini # bare string → the model's default effort
      aggregation: mean
      scoring: scale_1_5
      threshold: 0.5
      prompt: "Optional extra instructions for every judge."

Field	Type	Required	Default	Description
`models`	(string \| object)[]	Yes	—	One judge per entry — either a bare model name or a `{ model, reasoning_effort }` object. See Reasoning effort.
`aggregation`	enum	No	`"mean"`	How to combine per-judge scores. See Aggregation strategies.
`scoring`	`"binary"` \| `"scale_1_5"` \| `"scale_1_10"`	No	`"scale_1_5"`	Per-judge scoring scale (same options as the `prompt` grader).
`threshold`	number	No	`0.5` (scoring-dependent)	Normalized `[0, 1]` cutoff. For most strategies the aggregate score ≥ threshold ⇒ panel passes; for `unanimous`, every judge’s score must individually meet this threshold. Defaults to `DEFAULT_THRESHOLDS[scoring]` (currently `0.5` for every scale, but the per-scale default may change).
`prompt`	string	No	(none)	Optional extra instructions appended to every judge’s prompt. The stimulus `rubric` is always included.
`criteria`	object[]	No	(none)	Structured rubric criteria enabling per-criterion aggregation. When set, the panel scores each criterion across judges and derives a weighted overall.
`overall_threshold`	number	No	`threshold`	Cutoff applied to the weighted overall when `criteria` is used. The panel passes only when every required criterion passes and the weighted overall ≥ this value.

Reasoning effort

Set reasoning effort per judge by giving an entry the object form; leave others as bare strings to run at their default:

config:
  models:
    - { model: o3, reasoning_effort: high } # deliberates harder
    - { model: claude-opus-4.6, reasoning_effort: medium }
    - gpt-5.5-mini # no effort — model may not support it

Effort values are low, medium, high, or xhigh, and only apply to models whose capabilities support reasoning effort. Each judge’s effective effort is recorded on its panel/<label> sub-result metadata as reasoning_effort (where <label> is usually the model name, but may be disambiguated as model@<effort> and/or ...#<n> when the same model appears multiple times).

Aggregation strategies

Every strategy operates only on surviving judges — judges that returned a valid score. Failed judges are dropped before aggregation (not counted as 0), so “every” and “majority” below always mean “of the judges that responded”. Note that a judge failing to respond doesn’t just get dropped from the score — it also forces the panel verdict to FAIL. See Judge-error handling for details.

Strategy	What it does	Result range	Use when
`mean`	Arithmetic average of the surviving judges’ scores.	`[0, 1]`	You want a smooth, balanced consensus. Default.
`median`	Middle surviving-judge score (or average of the two middles for even counts).	`[0, 1]`	You want robustness against one outlier judge.
`min`	Lowest surviving-judge score.	`[0, 1]`	Strict: any single judge can fail the panel.
`majority`	`1` when more than half the surviving judges pass at `threshold`, `0` otherwise. Ties (common for even-sized panels) fall back to `mean`.	`[0, 1]`	You want a binary majority vote, e.g., LLM-as-jury.
`unanimous`	Reports the mean of the surviving judges’ scores, but the consensus only passes when every surviving judge passes at `threshold`. (A crashed judge is excluded from the unanimity check, but still fails the panel overall — see Judge-error handling.)	`[0, 1]`	You want a smooth score with a strict all-pass verdict.

Per-criterion aggregation

By default the panel aggregates each judge’s single holistic overall_score. Supply a criteria list to instead aggregate each rubric criterion across the judges and roll the results up into a weighted overall — mirroring a structured, weighted rubric.

graders:
  - type: panel
    config:
      models: [claude-sonnet-4.6, gpt-5.5, gpt-5.5-mini]
      aggregation: mean
      overall_threshold: 0.7
      criteria:
        - name: Correctness
          description: The solution fully solves the task.
          weight: 3
          pass_threshold: 0.6
        - name: Clarity
          description: The output is well-structured and easy to follow.
          weight: 1

Each criterion accepts:

Field	Type	Default	Description
`name`	string	required	Identifier shown to judges and echoed back in per-criterion results.
`description`	string	(none)	Human-readable explanation shown to judges in the rubric.
`weight`	number	`1`	Relative weight in the overall score. Weights are normalized to sum to 1. A single `0` excludes a criterion from the overall (it can still gate via `required`). Setting every weight to `0` is rejected by `vally lint` as `invalid-grader-config`; if that check is bypassed, the runtime falls back to equal weighting.
`pass_threshold`	number	`threshold`	Per-criterion cutoff on the normalized `[0, 1]` scale.
`required`	boolean	`true`	When `false`, the criterion is advisory: still scored, weighted into the overall, and reported, but it never blocks the all-pass gate.

How it works:

The criteria become the rubric shown to every judge, so criterion names stay aligned across judges. Judges are matched to criteria by name (case-insensitive); duplicate names are rejected by vally lint.
For each criterion, the surviving judges’ scores are aggregated with the configured aggregation strategy at that criterion’s pass_threshold.
The overall score is the weighted sum of the aggregated criterion scores.
The panel passes only when every required criterion passes and the weighted overall ≥ overall_threshold. Advisory criteria (required: false) are excluded from the all-pass gate but still contribute to the weighted overall. A required criterion that no judge scored fails.

Aggregated per-criterion results appear in details as panel/criterion/<name> alongside the usual per-judge panel/<model> sub-results, and a compact summary is exposed in metadata.criteria. Each panel/criterion/<name> result also carries a structured metadata.judge_scores map (model → normalized score) so reporters can see exactly what each judge gave without parsing the evidence string.

Judge-error handling

Any judge that fails forces the whole panel to fail, regardless of the surviving judges’ consensus. When a judge throws (rate limit, timeout, malformed JSON, non-zero exit, etc.):

The judge is recorded in metadata.failed_judges with the error message.
A failed panel/<model> sub-result is added to details so reporters still surface it.
The reported score is still computed over the surviving judges (so you can see how the responders scored)…
…but passed is forced to false, and the evidence is annotated with — forced FAIL: N judge(s) errored.
If every judge fails, the panel returns passed: false, score: 0, and the evidence lists each failure.

This is a strict, all-or-nothing policy: a panel only passes when every configured judge responds and the consensus (or per-criterion) verdict passes. There is no partial-credit mode.

Result shape

Each panel grade exposes a top-level GraderResult plus per-judge sub-results in details:

{
  "name": "panel",
  "kind": "llm",
  "passed": true,
  "score": 0.73,
  "evidence": "PASS via mean aggregation across 3 judge(s): ...",
  "details": [
    {
      "name": "panel/claude-sonnet-4.6",
      "passed": true,
      "score": 0.8,
      "evidence": "Clear, covers sorted input...",
      "details": [
        {
          "name": "panel/claude-sonnet-4.6/Mentions sorted input is required",
          "passed": true,
          "score": 1.0,
          "evidence": "...",
        },
        // one per rubric criterion
      ],
      "metadata": {
        "model": "claude-sonnet-4.6",
        "token_usage": {/* ... */},
      },
    },
    // one per judge
  ],
  "metadata": {
    "aggregation": "mean",
    "threshold": 0.5,
    "scoring": "scale_1_5",
    "models": ["claude-sonnet-4.6", "gpt-5.5", "gpt-5.5-mini"],
    "disagreement": 0.2,
    // failed_judges is present only when at least one judge failed.
  },
}

Per-judge information (score, latency, token usage, reasoning) lives in details[*] rather than being duplicated into metadata. Walk details to inspect what each judge returned.

The metadata above is the holistic shape. In criteria mode the top-level metadata swaps threshold for overall_threshold, adds a criteria summary array, and drops top-level disagreement — each criterion’s spread is on its panel/criterion/<name> sub-result at details[*].metadata.disagreement instead.

JSONL and JUnit reporters preserve the full details and metadata tree, so downstream tools have access to every per-judge score and reasoning trace.

Cost considerations

Every judge in models makes at least one LLM call per stimulus per trial:

Cost ≈ stimuli × runs × judges × per-call cost

Two ways to keep cost in check while iterating:

Mix tiers. Pair one strong judge (e.g., claude-sonnet-4.6) with one or two cheaper judges (e.g., gpt-5.5-mini).
Keep models lean. Every entry in models is one extra call per stimulus per trial; only add judges that earn their cost. Global LLM rate-limit / throughput control is owned by the runner and LlmClient, not by the panel grader itself.