EHR-Complex: Benchmarking Medical Agents for Complex Clinical Reasoning

2026-06-22Artificial Intelligence

Artificial Intelligence
AI summary

The authors created EHR-Complex, a big test to see how well computer programs can understand and work with messy, real-life electronic health records (EHRs). Unlike simpler tests, their benchmark has lots of complicated questions that need writing and running complex database commands, like SQL or Python code, to answer. They found current AI models struggle with these tasks, often making mistakes with database logic, medical code lookups, and understanding meaning. This shows that even the best systems today have trouble doing detailed and accurate medical data analysis. EHR-Complex aims to help improve AI tools for practical healthcare data work.

Electronic Health Records (EHRs)MIMIC-IVSQLLongitudinal dataClinical database reasoningLarge language models (LLMs)Database query executionMedical code lookupCompositional reasoningExact-match accuracy
Authors
Yitong Qiao, Lei Liu, Yue Shen, Jian Wang, Jinjie Gu, Zhixuan Chu, Kui Ren
Abstract
Clinical agents promise to democratize access to electronic health records (EHRs), yet existing benchmarks fail to reflect the complexity of practical EHR analysis, e.g., often operating on idealized, clean EHRs via static SQL generation rather than interactive execution. In this work, we introduce EHR-Complex, a large-scale benchmark designed for interactive clinical database reasoning. Built on the large MIMIC-IV substrate (365K patients, 31 tables, 500M+ records), EHR-Complex comprises about 52K tasks spanning six clinical intents, supporting both patient-level and population-level queries, where each task requires an agent to interact with a sandboxed environment by executing SQL queries or Python code. Notably, EHR-Complex considers the real-world SQL task complexity for longitudinal multi-table aggregation and compositional reasoning, resulting in 31.93 SQL structural components per query on average. Evaluation results on EHR-Complex reveal the clinical difficulty of these EHR reasoning scenarios, with the top-performing model achieving only 62.3% exact-match accuracy. Pass^k consistency drops below 50% for nearly all evaluated models at k=4, exposing broad stochastic fragility. A fine-grained analysis of more than 3,800 failed trajectories for representative LLMs reveals three dominant failure modes: SQL logic errors, medical-code lookup failures, and semantic misunderstandings. EHR-Complex provides a rigorous testbed for clinical agents and highlights remaining gaps in robust reasoning for large-scale EHR analysis.