3 increasingly strict bug-identification metrics · 20 repos × 100 injected bugs · n=100 per cell · bands are 95% confidence intervals
Level 1 — file match
Did any finding mention the buggy file? (keyword match)
Level 2 — function match
Did any finding identify the buggy function? (keyword match)
Judge — verdict
LLM judge: report is sufficient to fix the bug