Skip to the content.

Bug Fixing

This category follows the SWE-Bench methodology. The system is tasked with fixing a bug in the Business Central (AL) codebase given an issue description.

Baseline Leaderboard

Agent Model mean (95% CI) pass^5 Avg Time Ver
GitHub Copilot claude-sonnet-4-6 67.3% (66.1-68.5%) 48.5% 510.5s 0.4.0
GitHub Copilot claude-opus-4-6 66.9% (64.6-69.7%) 45.5% 477.9s 0.5.0
GitHub Copilot claude-opus-4-7 65.9% (64.4-67.5%) 50.5% 245.2s 0.5.1
Claude Code claude-opus-4-6 65.7% (64.4-67.1%) 45.5% 219.1s 0.5.0
GitHub Copilot gpt-5-2-codex 60.8% (59.2-62.0%) 49.5% 195.5s 0.2.2
GitHub Copilot claude-opus-4-5 59.8% (58.2-61.2%) 38.6% 172.0s 0.2.0
GitHub Copilot gpt-5-4 58.4% (55.8-60.8%) 37.6% 314.3s 0.3.1
GitHub Copilot claude-opus-4-5 58.4% (56.6-60.2%) 38.6% 164.7s 0.1.0
Claude Code claude-opus-4-5 57.4% (54.9-59.2%) 31.7% 204.6s 0.1.0
GitHub Copilot gpt-5-3-codex 55.8% (54.1-56.8%) 37.6% 106.6s 0.2.1
GitHub Copilot gpt-5-1-codex-max 53.7% (51.7-56.8%) 36.6% 229.2s 0.2.2
GitHub Copilot gpt-4-1 16.6% (15.6-17.2%) 5.0% 255.8s 0.2.2

MCP Server Experimental Configurations

Comparing experimental configurations for GitHub Copilot with claude-opus-4.6.

MCP Servers mean (95% CI) pass^5 Avg Time Ver
altool 71.3% (70.1-74.1%) 54.5% 570.7s 0.5.0
None 66.9% (64.6-69.7%) 45.5% 477.9s 0.5.0

← Back to Home