A Unifying Lens on Reward Uncertainty in RLHF

2026-06-08 • Machine Learning

Machine LearningArtificial IntelligenceComputation and Language

AI summaryⓘ

The authors talk about improving reinforcement learning when the system learns from human feedback but can cheat by exploiting mistakes in how rewards are judged, called reward hacking. They suggest using a distributional reward model that captures uncertainty better than traditional models. By applying advanced mathematical tools like Bayesian inference and KL divergence, they find a formula that combines different ways people have tried to handle uncertainty in reward models. Their work shows how previously used methods are just special cases of this formula, helping clarify when and why each method works.

Reinforcement LearningHuman FeedbackReward HackingReward ModelUncertaintyDistributional ModelBayesian InferenceKL DivergenceRobust OptimizationEnsemble Methods

Authors

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki

Abstract

Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is \emph{pessimism}: penalizing rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a \emph{distributional} reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

View PDFOpen arXiv