A Unifying Lens on Reward Uncertainty in RLHF
2026-06-08 • Machine Learning
Machine LearningArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors talk about improving reinforcement learning when the system learns from human feedback but can cheat by exploiting mistakes in how rewards are judged, called reward hacking. They suggest using a distributional reward model that captures uncertainty better than traditional models. By applying advanced mathematical tools like Bayesian inference and KL divergence, they find a formula that combines different ways people have tried to handle uncertainty in reward models. Their work shows how previously used methods are just special cases of this formula, helping clarify when and why each method works.
Reinforcement LearningHuman FeedbackReward HackingReward ModelUncertaintyDistributional ModelBayesian InferenceKL DivergenceRobust OptimizationEnsemble Methods
Authors
Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki
Abstract
Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is \emph{pessimism}: penalizing rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a \emph{distributional} reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.