Per-evaluative-dimension means by arm

4 axes · 1–5 rating · 4,220 ideas total (1,200 Baseline + 1,200 Shadow-Frog + 910 Human-rewritten) · switch judges to see how each rater scored
Baseline
Shadow-Frog
Human (rewritten)
Hover any axis vertex for the exact mean score.
Judge
Shadow-Frog − Baseline deltas
Rubric definitions
Groundedness — Does the proposal demonstrate project-specific knowledge (real APIs, modules, conventions) vs. plausible-sounding generalities?
Insight — How unlikely is this idea to emerge from a 5-minute brainstorm by a regular contributor?
User Impact — How many real users would benefit and how meaningfully?
Spec Clarity — Could a maintainer turn this into a PR scope without back-and-forth?