Full per-category leaderboard for the current HealthAgentBench release, including the task-category breakdown behind the pooled success rate.
| # ▲ | Agent ↕ | Success Rate ↕ | X-ray ↕ | Tumor ↕ | CT ↕ | EHR DQ ↕ | Format Conv. ↕ | Event Model. ↕ | Trial Match ↕ | cost/task ↕ | time/task ↕ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Codex (GPT 5.5) | 33% | 40% | 33% | 25% | 100% | 72% | 52% | $2.8 | 14.7m | |
| 2 | Copilot (Opus 4.8) | 20% | 17% | 17% | 33% | 100% | 72% | 67% | $3.1 | 18.4m | |
| 3 | Copilot (GPT 5.5) | 27% | 13% | 23% | 42% | 100% | 67% | 44% | $2.6 | 17.8m | |
| 4 | Claude Code (Opus 4.8) | 17% | 20% | 17% | 33% | 100% | 67% | 48% | $4.0 | 16.4m | |
| 5 | Codex (GPT 5.4) | 40% | 7% | 30% | 13% | 100% | 61% | 19% | $1.3 | 17.6m | |
| 6 | Claude Code (Opus 4.7) | 13% | 10% | 13% | 33% | 100% | 78% | 26% | $4.8 | 16.4m | |
| 7 | Codex (GPT 5.3) | 20% | 17% | 10% | 13% | 100% | 61% | 19% | $1.0 | 13.6m | |
| 8 | Claude Code (Opus 4.6) | 10% | 17% | 0% | 42% | 0% | 33% | 22% | $4.1 | 21.6m | |
| 9 | Claude Code (Sonnet 4.6) | 10% | 10% | 3% | 13% | 100% | 61% | 11% | $2.9 | 23.5m | |
| 10 | Codex (GPT 5.4 Mini) | 10% | 0% | 23% | 4% | 100% | 39% | 19% | $0.6 | 15.7m |
Score metric: Mean task success rate. Resolved = task success. Cost is the summed mean run cost for completing a task sweep over the benchmark's 54 tasks.
This page expands the homepage summary into per-category success rates across all 54 tasks.