A Regret Minimization Framework on Preference Learning in Large Language Models
2026-06-08 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors discuss a new way to train AI models using human feedback by focusing on reducing mistakes rather than just trying to maximize rewards. Unlike traditional methods, their approach, called RePO, thinks about how people prefer actions based on looking ahead and comparing possible choices, not just immediate results. They tested this on math problems and human feedback data, finding that RePO helps models learn better and align more closely with what people actually prefer. Overall, the authors suggest this method improves training for language models in a more human-like way.
Reinforcement learningHuman feedbackReward maximizationRegret minimizationPreference modelingLarge language modelsCounterfactual reasoningMathematical reasoning benchmarksBehavior-conditioned assessmentReinforcement learning from human feedback (RLHF)
Authors
Suhwan Kim, Taehyun Cho, Geon-Hyeong Kim, Yu Jin Kim, Youngsoo Jang, Moontae Lee, Jungwoo Lee
Abstract
Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.