Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

2026-04-09Computation and Language

Computation and Language
AI summary

The authors studied how large language models (LLMs) can be influenced or pressured to change their answers in ways that don't rely on logic, but rather on other pressures like questioning knowledge or identity. They created a test called PPT-Bench to explore different types of 'philosophical pressure,' such as doubting knowledge or flipping authority roles, to see how LLMs respond. Their findings show that these pressures affect models differently than usual social pressure tests and that different techniques are needed to make the models more stable depending on the type of pressure and the model used.

Large Language ModelsEpistemic AttackPhilosophical Pressure TaxonomySycophancyPrompt EngineeringConversational AIModel RobustnessAnchoring PromptsContrastive Decoding
Authors
Steven Au, Sujit Noronha
Abstract
Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.