A Regret Minimization Framework on Preference Learning in Large Language Models

2026-06-08 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors discuss a new way to train AI models using human feedback by focusing on reducing mistakes rather than just trying to maximize rewards. Unlike traditional methods, their approach, called RePO, thinks about how people prefer actions based on looking ahead and comparing possible choices, not just immediate results. They tested this on math problems and human feedback data, finding that RePO helps models learn better and align more closely with what people actually prefer. Overall, the authors suggest this method improves training for language models in a more human-like way.

Reinforcement learningHuman feedbackReward maximizationRegret minimizationPreference modelingLarge language modelsCounterfactual reasoningMathematical reasoning benchmarksBehavior-conditioned assessmentReinforcement learning from human feedback (RLHF)

Authors

Suhwan Kim, Taehyun Cho, Geon-Hyeong Kim, Yu Jin Kim, Youngsoo Jang, Moontae Lee, Jungwoo Lee

Abstract

Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.

View PDFOpen arXiv