Code Review
This category evaluates an agent’s ability to review a Business Central (AL) pull request. Given a diff, the agent produces structured review comments, which are scored against an expected (gold) set of findings.
Unlike the pass/fail categories, code review is scored with Precision / Recall / F1 over the matched comments. Expected and generated comments are paired by a globally optimal (one-to-one) assignment on file and line proximity (within a configured line tolerance), and a pair is only counted as matched when an LLM judge confirms the two describe the same underlying issue. Matched comments are additionally scored on how closely the agent’s severity classification tracks the expected severity.
Baseline Leaderboard
| Agent | Model | Micro F1 (95% CI) | Precision | Recall | Avg Time | Ver |
|---|---|---|---|---|---|---|
| GitHub Copilot | claude-opus-4-6 | 32.2% (30.6-33.3%) | 28.7% | 36.6% | 67.1s | 0.6.0 |
| GitHub Copilot | claude-opus-4-8 | 31.4% (29.3-32.9%) | 26.6% | 38.2% | 60.7s | 0.6.0 |
| Claude Code | claude-opus-4-8 | 29.3% (28.2-30.7%) | 22.6% | 41.8% | 57.2s | 0.6.0 |
| GitHub Copilot | claude-sonnet-4-6 | 27.3% (25.8-28.6%) | 20.9% | 39.4% | 83.2s | 0.6.0 |
| Claude Code | claude-sonnet-4-6 | 23.9% (23.2-24.5%) | 16.8% | 41.3% | 75.7s | 0.6.0 |
| GitHub Copilot | claude-opus-4-7 | 22.1% (21.6-23.4%) | 15.5% | 38.3% | 49.5s | 0.6.0 |
| GitHub Copilot | gpt-5-5 | 19.7% (18.9-21.0%) | 22.8% | 17.3% | 50.0s | 0.6.0 |
| GitHub Copilot | claude-haiku-4-5 | 19.7% (18.0-21.3%) | 18.8% | 20.6% | 51.5s | 0.6.0 |
Experiment Leaderboard
Compares review-knowledge configurations for the same model (see the Baseline Leaderboard above for the plain agent):
- Inline knowledge (pre-#8700) — the review checklists BCApps shipped inline before adopting BCQuality, injected as custom instructions.
No experiment results available yet. Check back soon!
How metrics are computed
- Precision — of the comments the agent generated, the fraction that matched an expected finding. Penalizes noisy reviews.
- Recall — of the expected findings, the fraction the agent caught. Penalizes missed issues.
- F1 — harmonic mean of precision and recall; balances both equally (the β=1 case of Fβ).
- Fβ (β=0.5) — precision-leaning F-score; use when false positives are costly (noisy reviews waste reviewer time).
- Fβ (β=2) — recall-leaning F-score; weights catching issues more than avoiding noise.
- Severity MAE — mean absolute error between the agent’s and the expected severity levels, over matched comments only. Lower is better;
0means every matched comment got the severity exactly right. - Valid output rate — fraction of tasks whose output parsed into a structured review. Failures score zero on every other metric. (Reported per run.)
- Micro vs. Macro — Micro sums matched/generated/expected across all tasks (tasks with many comments dominate); Macro averages per-task scores (every task counts equally).
- 95% CI — confidence interval bootstrapped over the per-task F1 scores, so the leaderboard reports sampling uncertainty even for a single run. The micro
F1CI resamples runs; theMacro F1CI resamples tasks.