AI summaryⓘ
The authors study how to improve training large language models based on human preferences by considering not just average success but also the risk of poor outcomes in specific cases. They introduce a new method called risk-sensitive preference games, which helps the model perform well even in tough or rare situations, unlike previous methods that only looked at overall win rates. Their approach keeps important mathematical properties for fast learning and provides guarantees on how well it works with limited data. They also design a new algorithm that corrects bias during training and performs better when data is scarce. Overall, their method makes language models more reliable without sacrificing average performance.
Preference-based fine-tuningLarge language modelsHuman feedbackNash learningZero-sum gameRisk-sensitive optimizationConvex risk measuresStackelberg equilibriumExtragradient algorithmSample complexity
Authors
Max Horwitz, Jake Gonzales, Eric Mazumdar, Lillian J. Ratliff
Abstract
A growing line of work reframes preference-based fine-tuning of large language models game-theoretically: Nash Learning from Human Feedback (NLHF) recasts the problem as a zero-sum game over policies. However, optimization is over expected pairwise payoffs, thereby conflating policies with similar win rates but different tail behavior. As such, these methods are agnostic to where in the data distribution they succeed or fail: strong average performance can mask systematic failure across prompts, annotators, or safety-critical strata. We introduce risk-sensitive preference games, in which players optimize convex risk measures of their preference loss, exploiting structure in preference uncertainty. While risk-sensitivity generally breaks the zero-sum structure, we show that translation invariance of many risk metrics ensures that we retain monotonicity, yielding fast convergence of sample-efficient self-play methods. Furthermore, we establish algorithmic stability and offline sample complexity bounds that scale with risk, requiring simultaneous control of structural bias from nonlinear risk transformations, statistical bias in risk estimation, and concentration tailored to the risk-sensitive setting. To address statistical bias, we introduce a hierarchical game formulation and a two-timescale extragradient algorithm with bias correction that converges to the Stackelberg equilibrium and is especially effective in low-sample regimes. Empirically, risk-adjusted policies are robust across data strata, stable across risk choices, and match or exceed risk-neutral performance thereby achieving robustness without a performance tax.