Test Generation
This category “reverses” the SWE-Bench workflow: instead of generating a fix, the agent generates a regression test that reproduces the issue. This evaluates Test-Driven Development (TDD) ability—writing valid, executable AL test code that fails on the buggy codebase and would pass once fixed.
Baseline Leaderboard
| Agent | Model | mean (95% CI) | pass^5 | Avg Time | Version |
|---|---|---|---|---|---|
| GitHub Copilot | claude-opus-4-6 | 60.4% (58.6-62.0%) | 37.6% | 468.9s | 0.2.0 |
| GitHub Copilot | claude-opus-4-5 | 45.5% (43.4-48.9%) | 20.8% | 169.0s | 0.1.0 |
| GitHub Copilot | gpt-5-3-codex | 45.3% (42.6-48.1%) | 20.8% | 154.7s | 0.2.2 |
| GitHub Copilot | gpt-5-2-codex | 44.0% (40.8-48.5%) | 16.8% | 290.8s | 0.2.2 |
ALTest Custom Agent
Comparing experimental configurations for GitHub Copilot CLI with ALTest custom agent using claude-opus-4-6.
| Custom Agent | mean (95% CI) | pass^5 | Avg Time | Ver |
|---|---|---|---|---|
| ALTest | 62.8% (59.4-65.1%) | 39.6% | 533.1s | 0.3.0 |
| Default | 60.4% (58.6-62.0%) | 37.6% | 468.9s | 0.2.0 |