Bug Fixing
This category follows the SWE-Bench methodology. The system is tasked with fixing a bug in the Business Central (AL) codebase given an issue description.
Baseline Leaderboard
| Agent | Model | mean (95% CI) | pass^5 | Avg Time | Ver |
|---|---|---|---|---|---|
| Claude Code | claude-opus-4-6 | 68.5% (65.0-71.3%) | 49.5% | 284.0s | 0.2.0 |
| GitHub Copilot | claude-sonnet-4-6 | 67.3% (66.1-68.5%) | 48.5% | 510.5s | 0.4.0 |
| GitHub Copilot | claude-opus-4-6 | 65.1% (63.0-67.7%) | 50.5% | 313.9s | 0.2.0 |
| GitHub Copilot | gpt-5-2-codex | 60.8% (59.2-62.0%) | 49.5% | 195.5s | 0.2.2 |
| GitHub Copilot | claude-opus-4-5 | 59.8% (58.2-61.2%) | 38.6% | 172.0s | 0.2.0 |
| GitHub Copilot | gpt-5-4 | 58.4% (55.8-60.8%) | 37.6% | 314.3s | 0.3.1 |
| GitHub Copilot | claude-opus-4-5 | 58.4% (56.6-60.2%) | 38.6% | 164.7s | 0.1.0 |
| Claude Code | claude-opus-4-5 | 57.4% (54.9-59.2%) | 31.7% | 204.6s | 0.1.0 |
| GitHub Copilot | gpt-5-3-codex | 55.8% (54.1-56.8%) | 37.6% | 106.6s | 0.2.1 |
| GitHub Copilot | gpt-5-1-codex-max | 53.7% (51.7-56.8%) | 36.6% | 229.2s | 0.2.2 |
| GitHub Copilot | gpt-4-1 | 16.6% (15.6-17.2%) | 5.0% | 255.8s | 0.2.2 |