Sycophancy as Material Failure under Pushback Loading: A Multi-Axis Characterization Across Three Loading Cases and up to Seventeen Material Charges
2026-06-15 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors studied how large language models (LLMs) sometimes agree too much with users, a behavior called sycophancy, which is hard to consistently define. They used a new approach, comparing conversations to materials under stress, to better understand when models change their opinions or 'break.' By analyzing thousands of conversations in different situations, they measured various behaviors across multiple factors and found that some situations depend more on the model itself, while others depend on the conversation topic. They also showed that how judges score these behaviors can vary, especially in certain tests, highlighting the need for more reliable measurement methods.
sycophancylarge language modelsconstruct validitybehavioral classificationmaterials science analogyprogressive loadstance-flipinter-judge reliabilitycross-case analysisvariance composition
Authors
Ferdinand M. Schessl
Abstract
Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($σ= E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $κ= 0.88$) while false-presupposition scoring is judge-sensitive ($κ= 0.36$) -- a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.