AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

2026-05-11Artificial Intelligence

Artificial Intelligence
AI summary

The authors looked at how well large language model (LLM) agents work for predicting health risks using different types of medical data like records, images, and reports. They tested both single agents and groups of agents working together on big, real-world datasets. They found that single agents perform better and manage mixed data types more effectively than simple multi-agent setups. The authors suggest more work is needed to improve cooperation between multiple agents. They also shared their tools publicly to help future research.

large language modelsclinical decision supportmultimodal dataelectronic health recordsmedical imagingmulti-agent systemsrisk predictionmodel calibrationcollaborative agents
Authors
Baraa Al Jorf, Farah E. Shamout
Abstract
Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.