HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

Leaderboard

Frontier agents ranked by pooled task success rate across 3 attempts * 54 tasks (162 trials).

# ▲	Agent ↕	Success Rate ↕	cost/task ↕
1	Codex (GPT 5.5)	42%	$2.8
2	Copilot (Opus 4.8)	36%	$3.1
3	Copilot (GPT 5.5)	35%	$2.6
4	Claude Code (Opus 4.8)	32%	$4.0
5	Codex (GPT 5.4)	28%	$1.3
6	Claude Code (Opus 4.7)	27%	$4.8
7	Codex (GPT 5.3)	22%	$1.0
8	Claude Code (Opus 4.6)	19%	$4.1
9	Claude Code (Sonnet 4.6)	17%	$2.9
10	Codex (GPT 5.4 Mini)	16%	$0.6

Score metric: Mean task success rate. Resolved = task success. Cost is the summed mean run cost for completing a task sweep over the benchmark's 54 tasks.

Expand Results

Related-Work Positioning

This interactive version of the paper figure situates HealthAgentBench among prior healthcare-agent benchmarks. The horizontal axis tracks interaction realism and autonomy across three categories, the vertical axis is the number of input modalities a benchmark covers, and each marker's size encodes the number of clinical workflow stages.

Task Categories and Environments

The current release groups 54 tasks into 7 categories, each with its own unique environment. Each card below shows the category, modality, access regime, and best reported success rate.

X-ray Report Corr.

X-ray Report Correction

Longitudinal 2D Imaging + Text Multimodal report editing 10 tasks

Review a counterfactual chest X-ray Findings draft against the target study and longitudinal history, then write the corrected report section. Task success criteria: 0 errors in the corrected report.

PhysioNet credentialed (MIMIC-CXR) 40%best success rate

Path. Tumor Area Sel.

Pathology Tumor Area Selection

Pathology WSI Tile prediction 10 tasks

Inspect a gigapixel pathology slide and predict the set of tumor tiles on a fixed analysis grid. Success criteria: ≥ 0.9 tile F1.

Public (CAMELYON16) 40%best success rate

CT Abnorm. Classif.

CT Abnormality Classification

3D imaging Visual finding detection 10 tasks

Interpret a chest CT volume and emit a per-volume yes/no label vector for the report-grounded abnormality findings. Success criteria: 1.0 accuracy.

HuggingFace gated (CT-RATE) 33%best success rate

EHR Quality Audit

EHR Data Quality Auditing

Tabular EHR Data quality auditing 8 tasks

Audit a corrupted EHR corpus that contin errors and flag the offending rows. Success criteria: 1.0 recall and > 0.01 precision.

Public (MIMIC-IV demo + synthetic errors) 42%best success rate

EHR Format Conv.

EHR Format Conversion

EHR (MEDS) Pipeline customization 1 task

Customize a real MEDS ETL repository so it emits the required transformed cohort from raw demo EHR input. Success criteria: output exact match.

Public (MIMIC-IV demo) 100%best success rate

EHR Event Model.

EHR Event Modelling

Longitudinal EHR Clinical event prediction 6 tasks

Learn a prediction strategy from longitudinal event timelines and output held-out patient risk scrores. Success rate: Match human engineered baseline.

Redivis-gated (Stanford STARR via EHRSHOT) 78%best success rate

Clinical Trial Match.

Clinical Trial Matching

Text Ranking / retrieval 9 tasks

Rank candidate clinical-trial IDs for a patient profile by eligibility confidence over a trial pool. Success criteria: 1.0 recall@50.

Public (TREC-CT 2021) 67%best success rate

Open call

Add Your Healthcare Task

Contribution

Help expand HealthAgentBench with realistic clinical workflows, datasets, and evaluation protocols from your domain.

Community submissions Contribute task proposal

Citation

If you use HealthAgentBench, cite it with the following BibTeX entry.

@misc{liu2026healthagentbench,
  title = {HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents},
  author = {Liu, Qianchu and Zhang, Sheng and Qin, Guanghui and Valanarasu, Jeya Maria Jose and Rokuss, Maximilian and Lu, Mingyu and Ossowski, Timothy and Chaves, Juan Manuel Zambrano and Wong, Cliff and Argaw, Peniel and Hasija, Yashna and Wei, Mu and Yim, Wen-wai and Liu, Qin and Jing, Zilin and Entenmann, Jason and Usuyama, Naoto and Naumann, Tristan and Poon, Hoifung},
  year = {2026},
  url = {https://arxiv.org/abs/2606.31179}
}

Leaderboard

Related-Work Positioning