Code Review

This category evaluates an agent’s ability to review a Business Central (AL) pull request. Given a diff, the agent produces structured review comments, which are scored against an expected (gold) set of findings.

Unlike the pass/fail categories, code review is scored with Precision / Recall / F1 over the matched comments. Expected and generated comments are paired by a globally optimal (one-to-one) assignment on file and line proximity (within a configured line tolerance), and a pair is only counted as matched when an LLM judge confirms the two describe the same underlying issue. Matched comments are additionally scored on how closely the agent’s severity classification tracks the expected severity.

Baseline Leaderboard

Agent	Model	Micro F1 (95% CI)	Precision	Recall	Avg Time	Ver
GitHub Copilot	claude-opus-4-6	32.2% (30.6-33.3%)	28.7%	36.6%	67.1s	0.6.0
GitHub Copilot	claude-opus-4-6	31.8% (29.2-33.0%)	26.4%	40.0%	51.0s	0.7.0
GitHub Copilot	claude-opus-4-8	31.4% (29.3-32.9%)	26.6%	38.2%	60.7s	0.6.0
GitHub Copilot	claude-opus-4-8	30.6% (29.3-32.6%)	24.4%	41.0%	62.5s	0.7.1
Claude Code	claude-opus-4-8	29.3% (28.2-30.7%)	22.6%	41.8%	57.2s	0.6.0
GitHub Copilot	claude-opus-4-8	28.6% (27.8-30.1%)	22.5%	39.4%	62.4s	0.7.0
GitHub Copilot	claude-sonnet-4-6	27.3% (25.8-28.6%)	20.9%	39.4%	83.2s	0.6.0
Claude Code	claude-sonnet-4-6	23.9% (23.2-24.5%)	16.8%	41.3%	75.7s	0.6.0
GitHub Copilot	claude-opus-4-7	22.1% (21.6-23.4%)	15.5%	38.3%	49.5s	0.6.0
GitHub Copilot	gpt-5-5	19.7% (18.9-21.0%)	22.8%	17.3%	50.0s	0.6.0
GitHub Copilot	claude-haiku-4-5	19.7% (18.0-21.3%)	18.8%	20.6%	51.5s	0.6.0

Experiment Leaderboard

Compares review-knowledge configurations for the same model (see the Baseline Leaderboard above for the plain agent):

Inline knowledge (pre-#8700) — the review checklists BCApps shipped inline before adopting BCQuality, injected as custom instructions.

Variant	Agent	Model	Micro F1 (95% CI)	Macro F1 (95% CI)	Precision	Recall	Avg Time	Ver
Inline knowledge (pre-#8700)	GitHub Copilot	gpt-5-5	53.2% (49.7-54.8%)	64.2% (60.8-67.4%)	52.0%	54.5%	72.6s	0.6.1
Inline knowledge (pre-#8700)	GitHub Copilot	claude-opus-4-8	53.1% (49.2-55.8%)	58.6% (55.4-61.6%)	42.9%	69.9%	138.8s	0.7.1
Inline knowledge (pre-#8700)	Claude Code	claude-opus-4-8	52.6% (51.5-53.6%)	62.5% (59.6-65.4%)	44.0%	65.5%	104.1s	0.6.1
Inline knowledge (pre-#8700)	GitHub Copilot	claude-opus-4-8	50.4% (48.9-52.8%)	59.9% (56.8-62.8%)	41.9%	63.2%	138.0s	0.6.1
Inline knowledge (pre-#8700)	GitHub Copilot	claude-opus-4-6	44.1% (42.6-46.1%)	50.1% (46.7-53.4%)	38.2%	52.1%	119.5s	0.6.1

How metrics are computed

Precision — of the comments the agent generated, the fraction that matched an expected finding. Penalizes noisy reviews.
Recall — of the expected findings, the fraction the agent caught. Penalizes missed issues.
F1 — harmonic mean of precision and recall; balances both equally (the β=1 case of Fβ).
Fβ (β=0.5) — precision-leaning F-score; use when false positives are costly (noisy reviews waste reviewer time).
Fβ (β=2) — recall-leaning F-score; weights catching issues more than avoiding noise.
Severity MAE — mean absolute error between the agent’s and the expected severity levels, over matched comments only. Lower is better; 0 means every matched comment got the severity exactly right.
Valid output rate — fraction of tasks whose output parsed into a structured review. Failures score zero on every other metric. (Reported per run.)
Micro vs. Macro — Micro sums matched/generated/expected across all tasks (tasks with many comments dominate); Macro averages per-task scores (every task counts equally).
95% CI — confidence interval bootstrapped over the per-task F1 scores, so the leaderboard reports sampling uncertainty even for a single run. The micro F1 CI resamples runs; the Macro F1 CI resamples tasks.

← Back to Home

Code Review - BC-Bench

Inspired by SWE-Bench, for the Business Central (AL) ecosystem.

Code Review

Baseline Leaderboard

Experiment Leaderboard

How metrics are computed