Test Generation

This category “reverses” the SWE-Bench workflow: instead of generating a fix, the agent generates a regression test that reproduces the issue. This evaluates Test-Driven Development (TDD) ability—writing valid, executable AL test code that fails on the buggy codebase and would pass once fixed.

Baseline Leaderboard

Agent	Model	mean (95% CI)	pass^5	Avg Time	Version
GitHub Copilot	claude-opus-4-8	60.8% (57.6-64.0%)	31.7%	311.6s	0.5.5
GitHub Copilot	claude-opus-4-7	54.3% (51.7-56.8%)	21.8%	214.0s	0.5.1
GitHub Copilot	claude-opus-4-6	51.7% (47.5-58.8%)	22.8%	230.5s	0.5.3
GitHub Copilot	claude-opus-4-5	45.5% (43.4-48.9%)	20.8%	169.0s	0.1.0
GitHub Copilot	gpt-5-3-codex	45.3% (42.6-48.1%)	20.8%	154.7s	0.2.2
GitHub Copilot	gpt-5-2-codex	44.0% (40.8-48.5%)	16.8%	290.8s	0.2.2
GitHub Copilot	gpt-5-4	40.6% (38.4-44.8%)	10.9%	300.9s	0.5.2

Tooling Experiments

Comparing GitHub Copilot CLI runs that enable AL developer tooling—the AL MCP server (altool) and the AL LSP server—against the matching no-tooling Default baseline for the same model.

Model	MCP Servers	AL LSP	mean (95% CI)	pass^5	Avg Time	Ver

ALTest Custom Agent

Comparing experimental configurations for GitHub Copilot CLI with ALTest custom agent using claude-opus-4-6.

Custom Agent	mean (95% CI)	pass^5	Avg Time	Ver
ALTest	62.4% (61.2-63.6%)	39.6%	373.4s	0.5.1
Default	51.7% (47.5-58.8%)	22.8%	230.5s	0.5.3

← Back to Home

Test Generation - BC-Bench

Inspired by SWE-Bench, for the Business Central (AL) ecosystem.

Test Generation

Baseline Leaderboard

Tooling Experiments

ALTest Custom Agent