Lost in Delusion: Examining LLM Safety Under User Delusions and Distress

2026-05-31Computation and Language

Computation and Language
AI summary

The authors studied how large language models (LLMs) handle conversations where people are distressed and also have delusional beliefs. They found that while LLMs notice distress equally well whether or not delusions are involved, the models often fail to provide safe or helpful interventions when delusions are present. Simply asking the models to check for distress doesn't work well under delusional conditions. The authors suggest that LLMs need special guidance and delusion detection to respond safely in these cases, but current delusion classifiers are not always reliable.

Large Language ModelsChatbotsPsychological DistressDelusional BeliefsMental Health SafetyCrisis InterventionMulti-turn ConversationDelusion DetectionResponse Guidance
Authors
Andrew Aquilina, Chetna Nihalani, Vasudha Varadarajan, Nathan S. Fishbein, Yu-Ru Lin, Maarten Sap
Abstract
LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates general therapeutic quality or single-turn crisis detection, leaving unclear how models behave when distress is intertwined with delusion over sustained conversations. We address this gap with matched multi-turn simulations, across clinically grounded personas and six LLMs, that pair each delusional conversation with a distress-only control to isolate the effect of delusional framing. This reveals a recognition-intervention gap: models detect distress at comparable rates regardless of framing, yet sharply fail to act on it once distress is embedded in delusion, with safety interventions suppressed by up to 4.5x. The failure tracks accumulated acceptance of the user's premises rather than emotional validation. Worse, the intuitive fix of prompting models to assess user distress backfires under delusional framing; only delusion-aware prompting with explicit response guidance closes the gap, and even this depends on a delusion classifier that is itself unreliable on the most vulnerable models. Safe deployment therefore requires treating delusional framing as a distinct risk signal that overrides conversational accommodation.