AI summaryⓘ
The authors studied whether a language model can internally estimate how likely its current actions are to reach its goals, which they call a "value". Using special training data, they found a specific pattern in the model's behavior that reflects this value, distinguishing between confident or uncertain responses and whether the model backtracks to fix mistakes. They showed that encouraging the model to aim for higher value reduces self-correction and wordiness, while lower value leads to more exploring and fixing. They also found that the model's internal value changes with training and handling sensitive topics. Overall, the authors suggest language models keep a simple, linear sense of how well their current strategy might work.
language modelsvalue estimationin-context learningreinforcement learningmodel activationsself-correctiondirect preference optimizationfine-tuningmodel confidencebacktracking
Authors
Nick Jiang, Isaac Kauvar, Jack Lindsey
Abstract
We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a "value" axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.