Skip to the content.

Bug Fixing

This category follows the SWE-Bench methodology. The system is tasked with fixing a bug in the Business Central (AL) codebase given an issue description.

Baseline Leaderboard

Agent Model mean (95% CI) pass^5 Avg Time Ver
Claude Code claude-opus-4-6 68.5% (65.0-71.3%) 49.5% 284.0s 0.2.0
GitHub Copilot claude-sonnet-4-6 67.3% (66.1-68.5%) 48.5% 510.5s 0.4.0
GitHub Copilot claude-opus-4-6 65.1% (63.0-67.7%) 50.5% 313.9s 0.2.0
GitHub Copilot gpt-5-2-codex 60.8% (59.2-62.0%) 49.5% 195.5s 0.2.2
GitHub Copilot claude-opus-4-5 59.8% (58.2-61.2%) 38.6% 172.0s 0.2.0
GitHub Copilot gpt-5-4 58.4% (55.8-60.8%) 37.6% 314.3s 0.3.1
GitHub Copilot claude-opus-4-5 58.4% (56.6-60.2%) 38.6% 164.7s 0.1.0
Claude Code claude-opus-4-5 57.4% (54.9-59.2%) 31.7% 204.6s 0.1.0
GitHub Copilot gpt-5-3-codex 55.8% (54.1-56.8%) 37.6% 106.6s 0.2.1
GitHub Copilot gpt-5-1-codex-max 53.7% (51.7-56.8%) 36.6% 229.2s 0.2.2
GitHub Copilot gpt-4-1 16.6% (15.6-17.2%) 5.0% 255.8s 0.2.2

← Back to Home