HealthAgentBench
Modalities (5): text, EHR, 2D, 3D, WSI
Workflow stages (4): diagnosis, treatment, data, research
HealthAgentBench evaluates frontier agents on realistic healthcare workflows spanning seven task categories with distinct environments and diverse data modalities, including 2D chest X-rays, 3D CT volumes, gigapixel whole-slide pathology images, free-text clinical documents, and structured EHR data. Each task is executed in a terminal environment using real clinical artefacts with minimal instructions, and evaluated against hidden gold labels to determine task success.
Frontier agents ranked by pooled task success rate across 3 attempts * 54 tasks (162 trials).
| # ▲ | Agent ↕ | Success Rate ↕ | cost/task ↕ |
|---|---|---|---|
| 1 | Codex (GPT 5.5) | $2.8 | |
| 2 | Copilot (Opus 4.8) | $3.1 | |
| 3 | Copilot (GPT 5.5) | $2.6 | |
| 4 | Claude Code (Opus 4.8) | $4.0 | |
| 5 | Codex (GPT 5.4) | $1.3 | |
| 6 | Claude Code (Opus 4.7) | $4.8 | |
| 7 | Codex (GPT 5.3) | $1.0 | |
| 8 | Claude Code (Opus 4.6) | $4.1 | |
| 9 | Claude Code (Sonnet 4.6) | $2.9 | |
| 10 | Codex (GPT 5.4 Mini) | $0.6 |
Score metric: Mean task success rate. Resolved = task success. Cost is the summed mean run cost for completing a task sweep over the benchmark's 54 tasks.
This interactive version of the paper figure situates HealthAgentBench among prior healthcare-agent benchmarks. The horizontal axis tracks interaction realism and autonomy across three categories, the vertical axis is the number of input modalities a benchmark covers, and each marker's size encodes the number of clinical workflow stages.
Each column is a task category with its unique environment. Each cell encodes three signals at once: color is the per-category success rate, circle size is average wall-clock time, and the green $ tier is the mean run cost.
The current release groups 54 tasks into 7 categories, each with its own unique environment. Each card below shows the category, modality, access regime, and best reported success rate.
Review a counterfactual chest X-ray Findings draft against the target study and longitudinal history, then write the corrected report section. Task success criteria: 0 errors in the corrected report.
Inspect a gigapixel pathology slide and predict the set of tumor tiles on a fixed analysis grid. Success criteria: ≥ 0.9 tile F1.
Interpret a chest CT volume and emit a per-volume yes/no label vector for the report-grounded abnormality findings. Success criteria: 1.0 accuracy.
Audit a corrupted EHR corpus that contin errors and flag the offending rows. Success criteria: 1.0 recall and > 0.01 precision.
Customize a real MEDS ETL repository so it emits the required transformed cohort from raw demo EHR input. Success criteria: output exact match.
Learn a prediction strategy from longitudinal event timelines and output held-out patient risk scrores. Success rate: Match human engineered baseline.
Rank candidate clinical-trial IDs for a patient profile by eligibility confidence over a trial pool. Success criteria: 1.0 recall@50.
Help expand HealthAgentBench with realistic clinical workflows, datasets, and evaluation protocols from your domain.
If you use HealthAgentBench, cite it with the following BibTeX entry.
@misc{liu2026healthagentbench,
title = {HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents},
author = {Liu, Qianchu and Zhang, Sheng and Qin, Guanghui and Valanarasu, Jeya Maria Jose and Rokuss, Maximilian and Lu, Mingyu and Ossowski, Timothy and Chaves, Juan Manuel Zambrano and Wong, Cliff and Argaw, Peniel and Hasija, Yashna and Wei, Mu and Yim, Wen-wai and Liu, Qin and Jing, Zilin and Entenmann, Jason and Usuyama, Naoto and Naumann, Tristan and Poon, Hoifung},
year = {2026},
url = {https://arxiv.org/abs/2606.31179}
}