54 tasks across 7 categories

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

HealthAgentBench evaluates frontier agents on realistic healthcare workflows spanning seven task categories with distinct environments and diverse data modalities, including 2D chest X-rays, 3D CT volumes, gigapixel whole-slide pathology images, free-text clinical documents, and structured EHR data. Each task is executed in a terminal environment using real clinical artefacts with minimal instructions, and evaluated against hidden gold labels to determine task success.

54
tasks
7
environments
7
categories
10
frontier agents
42%
best success rate

Leaderboard

Frontier agents ranked by pooled task success rate across 3 attempts * 54 tasks (162 trials).

# Agent Success Rate cost/task
1
Codex (GPT 5.5)
42%
$2.8
2
Copilot (Opus 4.8)
36%
$3.1
3
Copilot (GPT 5.5)
35%
$2.6
4
Claude Code (Opus 4.8)
32%
$4.0
5
Codex (GPT 5.4)
28%
$1.3
6
Claude Code (Opus 4.7)
27%
$4.8
7
Codex (GPT 5.3)
22%
$1.0
8
Claude Code (Opus 4.6)
19%
$4.1
9
Claude Code (Sonnet 4.6)
17%
$2.9
10
Codex (GPT 5.4 Mini)
16%
$0.6

Score metric: Mean task success rate. Resolved = task success. Cost is the summed mean run cost for completing a task sweep over the benchmark's 54 tasks.

Related-Work Positioning

This interactive version of the paper figure situates HealthAgentBench among prior healthcare-agent benchmarks. The horizontal axis tracks interaction realism and autonomy across three categories, the vertical axis is the number of input modalities a benchmark covers, and each marker's size encodes the number of clinical workflow stages.

Performance by Task Categories (Environments)

Each column is a task category with its unique environment. Each cell encodes three signals at once: color is the per-category success rate, circle size is average wall-clock time, and the green $ tier is the mean run cost.

Task Category (Environment) Agent X-ray ReportCorr. CT Abnorm.Classif. Path. TumorArea Sel. EHR QualityAudit ClinicalTrial Match. EHR EventModel. EHR FormatConv. Codex (GPT 5.5)Codex (GPT 5.4)Codex (GPT 5.3)Codex (GPT 5.4 Mini)Copilot (Opus 4.8)Copilot (GPT 5.5)Claude Code (Opus 4.8)Claude Code (Opus 4.7)Claude Code (Opus 4.6)Claude Code (Sonnet 4.6) $ $$ $$$$ $$$$ $$$ $$$$ $$ $ $ $$ $$ $$ $$ $ $ $ $$$ $$ $ $ $ $ $ $ $$ $ $$ $ $$ $$$$ $$$$ $$$$ $$$ $$ $$ $ $$$ $$$$ $$$$ $$$$ $$ $$ $$ $$$$ $$$$ $$$$ $$$$ $$ $$ $ $$$$ $$$$ $$$$ $$$$ $$$ $$$ $ $$$ $$$$ $$$$ $$$$ $$$ $ $ $$$$ $$$ $$$ $$$$ $$$ $ Medical Imaging(multimodal) Database Retrieval(text) EHR modeling Task success rate 0%50%100% higher is better Avg wall-clock time 2m 5m 10m 20m 40m 60m smaller = faster Run cost tier (USD) $ < $1 $$ $1–2 $$$ $2–3 $$$$ > $3

Task Categories and Environments

The current release groups 54 tasks into 7 categories, each with its own unique environment. Each card below shows the category, modality, access regime, and best reported success rate.

X-ray Report Corr.

X-ray Report Correction

Longitudinal 2D Imaging + Text Multimodal report editing 10 tasks

Review a counterfactual chest X-ray Findings draft against the target study and longitudinal history, then write the corrected report section. Task success criteria: 0 errors in the corrected report.

PhysioNet credentialed (MIMIC-CXR) 40%best success rate
Path. Tumor Area Sel.

Pathology Tumor Area Selection

Pathology WSI Tile prediction 10 tasks

Inspect a gigapixel pathology slide and predict the set of tumor tiles on a fixed analysis grid. Success criteria: ≥ 0.9 tile F1.

Public (CAMELYON16) 40%best success rate
CT Abnorm. Classif.

CT Abnormality Classification

3D imaging Visual finding detection 10 tasks

Interpret a chest CT volume and emit a per-volume yes/no label vector for the report-grounded abnormality findings. Success criteria: 1.0 accuracy.

HuggingFace gated (CT-RATE) 33%best success rate
EHR Quality Audit

EHR Data Quality Auditing

Tabular EHR Data quality auditing 8 tasks

Audit a corrupted EHR corpus that contin errors and flag the offending rows. Success criteria: 1.0 recall and > 0.01 precision.

Public (MIMIC-IV demo + synthetic errors) 42%best success rate
EHR Format Conv.

EHR Format Conversion

EHR (MEDS) Pipeline customization 1 task

Customize a real MEDS ETL repository so it emits the required transformed cohort from raw demo EHR input. Success criteria: output exact match.

Public (MIMIC-IV demo) 100%best success rate
EHR Event Model.

EHR Event Modelling

Longitudinal EHR Clinical event prediction 6 tasks

Learn a prediction strategy from longitudinal event timelines and output held-out patient risk scrores. Success rate: Match human engineered baseline.

Redivis-gated (Stanford STARR via EHRSHOT) 78%best success rate
Clinical Trial Match.

Clinical Trial Matching

Text Ranking / retrieval 9 tasks

Rank candidate clinical-trial IDs for a patient profile by eligibility confidence over a trial pool. Success criteria: 1.0 recall@50.

Public (TREC-CT 2021) 67%best success rate
Open call

Add Your Healthcare Task

Contribution

Help expand HealthAgentBench with realistic clinical workflows, datasets, and evaluation protocols from your domain.

Community submissions Contribute task proposal

Citation

If you use HealthAgentBench, cite it with the following BibTeX entry.

@misc{liu2026healthagentbench,
  title = {HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents},
  author = {Liu, Qianchu and Zhang, Sheng and Qin, Guanghui and Valanarasu, Jeya Maria Jose and Rokuss, Maximilian and Lu, Mingyu and Ossowski, Timothy and Chaves, Juan Manuel Zambrano and Wong, Cliff and Argaw, Peniel and Hasija, Yashna and Wei, Mu and Yim, Wen-wai and Liu, Qin and Jing, Zilin and Entenmann, Jason and Usuyama, Naoto and Naumann, Tristan and Poon, Hoifung},
  year = {2026},
  url = {https://arxiv.org/abs/2606.31179}
}