Skip to the content.

Code Review

This category evaluates an agent’s ability to review a Business Central (AL) pull request. Given a diff, the agent produces structured review comments, which are scored against an expected (gold) set of findings.

Unlike the pass/fail categories, code review is scored with Precision / Recall / F1 over the matched comments. Expected and generated comments are paired by a globally optimal (one-to-one) assignment on file and line proximity (within a configured line tolerance), and a pair is only counted as matched when an LLM judge confirms the two describe the same underlying issue. Matched comments are additionally scored on how closely the agent’s severity classification tracks the expected severity.

Baseline Leaderboard

Agent Model Micro F1 (95% CI) Precision Recall Avg Time Ver
GitHub Copilot claude-opus-4-6 32.2% (30.6-33.3%) 28.7% 36.6% 67.1s 0.6.0
GitHub Copilot claude-opus-4-8 31.4% (29.3-32.9%) 26.6% 38.2% 60.7s 0.6.0
Claude Code claude-opus-4-8 29.3% (28.2-30.7%) 22.6% 41.8% 57.2s 0.6.0
GitHub Copilot claude-sonnet-4-6 27.3% (25.8-28.6%) 20.9% 39.4% 83.2s 0.6.0
Claude Code claude-sonnet-4-6 23.9% (23.2-24.5%) 16.8% 41.3% 75.7s 0.6.0
GitHub Copilot claude-opus-4-7 22.1% (21.6-23.4%) 15.5% 38.3% 49.5s 0.6.0
GitHub Copilot gpt-5-5 19.7% (18.9-21.0%) 22.8% 17.3% 50.0s 0.6.0
GitHub Copilot claude-haiku-4-5 19.7% (18.0-21.3%) 18.8% 20.6% 51.5s 0.6.0

Experiment Leaderboard

Compares review-knowledge configurations for the same model (see the Baseline Leaderboard above for the plain agent):

No experiment results available yet. Check back soon!

How metrics are computed

← Back to Home