Detailed Results

Full per-category leaderboard for the current HealthAgentBench release, including the task-category breakdown behind the pooled success rate.

# Agent Success Rate X-ray Tumor CT EHR DQ Format Conv. Event Model. Trial Match cost/task time/task
1
Codex (GPT 5.5)
42%
33% 40% 33% 25% 100% 72% 52% $2.8 14.7m
2
Copilot (Opus 4.8)
36%
20% 17% 17% 33% 100% 72% 67% $3.1 18.4m
3
Copilot (GPT 5.5)
35%
27% 13% 23% 42% 100% 67% 44% $2.6 17.8m
4
Claude Code (Opus 4.8)
32%
17% 20% 17% 33% 100% 67% 48% $4.0 16.4m
5
Codex (GPT 5.4)
28%
40% 7% 30% 13% 100% 61% 19% $1.3 17.6m
6
Claude Code (Opus 4.7)
27%
13% 10% 13% 33% 100% 78% 26% $4.8 16.4m
7
Codex (GPT 5.3)
22%
20% 17% 10% 13% 100% 61% 19% $1.0 13.6m
8
Claude Code (Opus 4.6)
19%
10% 17% 0% 42% 0% 33% 22% $4.1 21.6m
9
Claude Code (Sonnet 4.6)
17%
10% 10% 3% 13% 100% 61% 11% $2.9 23.5m
10
Codex (GPT 5.4 Mini)
16%
10% 0% 23% 4% 100% 39% 19% $0.6 15.7m

Score metric: Mean task success rate. Resolved = task success. Cost is the summed mean run cost for completing a task sweep over the benchmark's 54 tasks.

This page expands the homepage summary into per-category success rates across all 54 tasks.