When Preferences Fail to Become Incentives: A Utility-Behavior Gap in Large Language Models

2026-06-22Artificial Intelligence

Artificial Intelligence
AI summary

The authors studied whether preferences shown by large language models (LLMs) in simple choice tests actually influence the models' behavior in real tasks like writing essays or translations. They confirmed that LLMs have consistent preferences in these choice tests but found that these preferences do not change how well the models perform on real tasks, even when given incentives based on their reported preferences. This means just because a model shows a preference in a test doesn't mean it will act on it or produce better or worse work because of it. The authors caution against assuming these preferences have any practical effect on model behavior.

large language modelspreference elicitationutility structuremodel behaviorincentivesrealistic scenariostask performancecoherent preferencesmisaligned goals
Authors
Yujun Zhou, Christopher M. Ackerman
Abstract
Recent work on preference elicitation in large language models (LLMs) has demonstrated that, when given a series of choices between two outcomes, LLMs reveal a coherent, model-specific utility structure. Notably, this structure often includes preferences that the models' trainers did not intend, such as valuing people of some nationalities above others, raising the possibility that LLMs might be forming emergent, misaligned goals, which, if true, would have major safety implications. However, the choice paradigms in which these preferences are observed are not reflective of real-world situations in which misaligned behavior would be a practical concern. Therefore, we design an experimental paradigm to probe whether these preferences serve as motivations for LLM behavior in realistic scenarios. First, we reproduce prior findings on consistent preference elicitation. Next, we create a set of common writing tasks - essays, grant proposal abstracts, incident postmortems, and translations - where quality can be assessed by a blind, independent LLM judge panel. Then, we demonstrate that LLMs can be motivated via direct exhortation and other explicit cues to modulate their output quality on these tasks. Finally, we probe whether utilities inferred from explicitly reported preferences can shift output quality on these tasks by offering LLMs high-utility incentives for high-quality outputs. In all tasks, across all models tested, offering LLMs outcomes that they report in the choice paradigm as being highly preferred does not lead them to create higher quality outputs than offering them dispreferred outcomes, or even no outcomes at all. We conclude that the existence of coherent preferences as demonstrated in choice paradigms should not be taken as evidence that those preferences have incentive value for the models or affect their behavior in other contexts.