Results — HealthAgentBench

Detailed Results

Full per-category leaderboard for the current HealthAgentBench release, including the task-category breakdown behind the pooled success rate.

Back to overview

# ▲	Agent ↕	Success Rate ↕	X-ray ↕	Tumor ↕	CT ↕	EHR DQ ↕	Format Conv. ↕	Event Model. ↕	Trial Match ↕	cost/task ↕	time/task ↕
1	Codex (GPT 5.5)	42%	33%	40%	33%	25%	100%	72%	52%	$2.8	14.7m
2	Copilot (Opus 4.8)	36%	20%	17%	17%	33%	100%	72%	67%	$3.1	18.4m
3	Copilot (GPT 5.5)	35%	27%	13%	23%	42%	100%	67%	44%	$2.6	17.8m
4	Claude Code (Opus 4.8)	32%	17%	20%	17%	33%	100%	67%	48%	$4.0	16.4m
5	Codex (GPT 5.4)	28%	40%	7%	30%	13%	100%	61%	19%	$1.3	17.6m
6	Claude Code (Opus 4.7)	27%	13%	10%	13%	33%	100%	78%	26%	$4.8	16.4m
7	Codex (GPT 5.3)	22%	20%	17%	10%	13%	100%	61%	19%	$1.0	13.6m
8	Claude Code (Opus 4.6)	19%	10%	17%	0%	42%	0%	33%	22%	$4.1	21.6m
9	Claude Code (Sonnet 4.6)	17%	10%	10%	3%	13%	100%	61%	11%	$2.9	23.5m
10	Codex (GPT 5.4 Mini)	16%	10%	0%	23%	4%	100%	39%	19%	$0.6	15.7m

Score metric: Mean task success rate. Resolved = task success. Cost is the summed mean run cost for completing a task sweep over the benchmark's 54 tasks.

This page expands the homepage summary into per-category success rates across all 54 tasks.