Skip to the content.

Test Generation

This category “reverses” the SWE-Bench workflow: instead of generating a fix, the agent generates a regression test that reproduces the issue. This evaluates Test-Driven Development (TDD) ability—writing valid, executable AL test code that fails on the buggy codebase and would pass once fixed.

Baseline Leaderboard

Agent Model mean (95% CI) pass^5 Avg Time Version
GitHub Copilot claude-opus-4-6 60.4% (58.6-62.0%) 37.6% 468.9s 0.2.0
GitHub Copilot claude-opus-4-5 45.5% (43.4-48.9%) 20.8% 169.0s 0.1.0
GitHub Copilot gpt-5-3-codex 45.3% (42.6-48.1%) 20.8% 154.7s 0.2.2
GitHub Copilot gpt-5-2-codex 44.0% (40.8-48.5%) 16.8% 290.8s 0.2.2

ALTest Custom Agent

Comparing experimental configurations for GitHub Copilot CLI with ALTest custom agent using claude-opus-4-6.

Custom Agent mean (95% CI) pass^5 Avg Time Ver
ALTest 62.8% (59.4-65.1%) 39.6% 533.1s 0.3.0
Default 60.4% (58.6-62.0%) 37.6% 468.9s 0.2.0

← Back to Home