Physiology-Aware CNN and Zero-Shot Multimodal LLMs for ECG Image Classification: A Comparative Study
2026-06-22 • Machine Learning
Machine Learning
AI summaryⓘ
The authors tested if advanced language models, originally designed to understand text and images, can correctly tell if heart ECG images are normal or abnormal without prior training. They found that these language models performed about as well as random guessing. On the other hand, specialized neural networks designed with heart-related knowledge were good at this task, performing well on different datasets. This suggests that tailored medical AI models work better for interpreting ECGs than general-purpose language models. So, for accurate ECG analysis, specific domain-focused AI is still needed.
ECG12-lead ECGmultimodal large language modelszero-shot learningCNNROC-AUCPTB-XL datasetwaveform morphologylead groupsbinary classification
Authors
Khalil Ahammad, Derek Abbott, Mohsen Dorraki
Abstract
Multimodal large language models (LLMs) are increasingly adopted to interpret 12-lead ECG images, though the interpretations often lack validation. However, ECG image understanding significantly differs from general images as it depends on precise waveform morphology, lead relationships and accurate interval measurements. This study investigated whether zero-shot multimodal LLMs can reliably distinguish normal and abnormal ECG images and, in parallel, evaluated CNN-based models for clinically grounded references. Standard 12-lead ECG recordings were rendered as single-page images for a binary normal-abnormal classification task. Three prominent LLMs (GPT-5.2, GPT-4.1, and Gemini-2.5 Pro) were tested using a fixed zero-shot prompt across multiple runs. In parallel, a physiology-aware CNN-based model was developed with the capability to aggregate features from the predefined anatomical lead groups. The model was compared with ResNet18, DenseNet121, VGG16 baselines, and all the models were evaluated on an internal test set and external PTB-XL dataset. Across seeds, CNN-based models demonstrated stable discrimination, with average internal ROC-AUC of 0.92-0.94, and external ROC-AUC of 0.85-0.86. The proposed LeadGroupECG model significantly improved over its backbone internally without compromising external generalization. It remained competitive with other baselines, while consistently highlighting anatomical lead-group contributions. In contrast, zero-shot LLM discrimination remained near-chance (ROC-AUC around 0.5). The PR-AUC improved slightly when ECGs used a grid-based calibration background compared with the grid-free ECGs. Although multimodal LLMs can generate reasonable ECG narratives, their zero-shot diagnostic discrimination remains limited. Therefore, clinically framed, domain-specific architectures remain essential for AI-based ECG interpretation.