Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

2026-04-22 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors studied how well large language models (LLMs) communicate in healthcare compared to doctors. They found that basic models often express stronger negative emotions and use more complex language than physicians. Trying to make the models more empathetic helped with tone and simplicity but didn't improve how accurate the information was. The best results came when humans and models worked together to rewrite responses, making them clearer and more emotionally balanced. Overall, the authors suggest these models are better used as helpers for communication, not as a replacement for medical experts.

Large Language Modelssemantic fidelityreadabilityaffective resonanceFlesch-Kincaid Grade Levelphysician-patient communicationempathy-oriented promptingcollaborative rewritingepistemic criteriahuman-AI collaboration

Authors

Mariano Barone, Francesco Di Serio, Roberto Moio, Marco Postiglione, Giuseppe Riccio, Antonio Romano, Vincenzo Moscato

Abstract

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

View PDFOpen arXiv