BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

2026-06-03 • Artificial Intelligence

Artificial IntelligenceComputation and LanguageComputers and SocietyMachine Learning

AI summaryⓘ

The authors address the problem of reducing social bias in large language models, which is tricky because bias isn't a clear-cut or objective issue. They point out problems with earlier methods like DPO, which lacks exploration, and PPO, which can be unstable. To fix this, they propose a new method called BiasGRPO that stabilizes training by comparing rewards across groups of model outputs. Their experiments show BiasGRPO works better than previous methods, and they also provide a new efficient reward model and an extended dataset to help guide bias reduction in future work.

Large Language ModelsSocial BiasAlignmentDirect Preference Optimization (DPO)Proximal Policy Optimization (PPO)Group Relative Policy Optimization (GRPO)Reinforcement Learning with Human Feedback (RLHF)Reward ModelBias Mitigation

Authors

Saket Reddy, Ke Yang, ChengXiang Zhai

Abstract

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.

View PDFOpen arXiv