Bug Fixing

This category follows the SWE-Bench methodology. The system is tasked with fixing a bug in the Business Central (AL) codebase given an issue description.

Baseline Leaderboard

Agent	Model	mean (95% CI)	pass^5	Avg Time	Ver
GitHub Copilot	claude-sonnet-4-6	67.3% (66.1-68.5%)	48.5%	510.5s	0.4.0
GitHub Copilot	claude-opus-4-6	66.9% (64.6-69.7%)	45.5%	477.9s	0.5.0
GitHub Copilot	claude-opus-4-7	65.9% (64.4-67.5%)	50.5%	245.2s	0.5.1
Claude Code	claude-opus-4-6	65.7% (64.4-67.1%)	45.5%	219.1s	0.5.0
GitHub Copilot	gpt-5-2-codex	60.8% (59.2-62.0%)	49.5%	195.5s	0.2.2
GitHub Copilot	claude-opus-4-5	59.8% (58.2-61.2%)	38.6%	172.0s	0.2.0
GitHub Copilot	gpt-5-4	58.4% (55.8-60.8%)	37.6%	314.3s	0.3.1
GitHub Copilot	claude-opus-4-5	58.4% (56.6-60.2%)	38.6%	164.7s	0.1.0
Claude Code	claude-opus-4-5	57.4% (54.9-59.2%)	31.7%	204.6s	0.1.0
GitHub Copilot	gpt-5-3-codex	55.8% (54.1-56.8%)	37.6%	106.6s	0.2.1
GitHub Copilot	gpt-5-1-codex-max	53.7% (51.7-56.8%)	36.6%	229.2s	0.2.2
GitHub Copilot	gpt-4-1	16.6% (15.6-17.2%)	5.0%	255.8s	0.2.2

MCP Server Experimental Configurations

Comparing experimental configurations for GitHub Copilot with claude-opus-4.6.

MCP Servers	mean (95% CI)	pass^5	Avg Time	Ver
altool	71.3% (70.1-74.1%)	54.5%	570.7s	0.5.0
None	66.9% (64.6-69.7%)	45.5%	477.9s	0.5.0

← Back to Home