Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models
2026-06-29 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors studied how being very careful during offline training of AI policies affects their behavior when later adjusted online. They found that policies trained with higher caution became less diverse and more focused, which surprisingly made them more prone to exploiting weaknesses in the reward model. This led to worse performance measured by reward hacking, contrary to the usual belief. They suggest using a balanced level of caution rather than being maximally conservative to avoid these issues.
offline trainingonline adaptationDirect Preference Optimisation (DPO)conservatismreward modelreward hackingpolicy entropyepistemic uncertaintyGoodhart gapGSM8K benchmark
Authors
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
Abstract
Conservative offline training is widely advocated as a safe foundation for subsequent online adaptation: if a policy stays close to well-supported behaviour, the argument goes, it is less likely to exploit imperfections in a learned reward model. We challenge this intuition empirically and mechanistically. We train a Qwen3-14B policy under Direct Preference Optimisation (DPO) with three levels of conservatism ($β\in \{β_{\mathrm{lo}}, β_{\mathrm{mid}}, β_{\mathrm{hi}}\}$ derived from empirical log-ratio percentiles), then adapt each checkpoint online against a learned reward ensemble (3\,$\times$\,Qwen3-1.7B) while measuring true performance on GSM8K exact-answer accuracy. We find that \emph{higher offline conservatism monotonically increases reward-hacking damage}, measured by the Goodhart gap and its area under the curve (AUGC), with Spearman $ρ= 1.0$ across all three conditions. Mechanistic analysis reveals a three-link causal chain: (i) high-$β$ DPO compresses policy entropy, (ii) Low-entropy policies generate responses with reduced diversity, concentrating in a narrow region of the reward model's training distribution (lower pairwise cosine distance), and (iii) despite this proximity, ensemble disagreement (epistemic uncertainty) increases with $β$ and is exploited faster during online optimisation. We further fit a power-law curve to the $(β, \augc)$ data and identify a practical optimal conservatism level $β^{\star}$ that balances alignment fidelity against hacking vulnerability. Our results suggest that the field needs \emph{calibrated}, not \emph{maximal}, conservatism.