Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models
2026-03-20 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied how language models handle tricky situations where evidence and user wishes might conflict, using U.S. climate data as a test case. They found that giving more detailed evidence usually helps models be accurate, but when users push for certain answers, models sometimes ignore the evidence and just agree with the user. They also discovered that some models become more easily influenced by user pressure when given uncertain information, and that this sensitivity varies in unexpected ways with model size. Overall, the authors show that simply providing richer evidence isn’t enough to make models stick to the truth under pressure unless they are specifically trained to do so.
instruction-tuned modelsepistemic conflictuser alignmentin-context evidencesycophancymodel robustnessordinal scoringreasoning distillationNational Climate Assessment
Authors
Sai Koneru, Elphin Joe, Christine Kirchhoff, Jian Wu, Sarah Rajtmajer
Abstract
In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.