CRPO: Character-centric Group Relative Policy Optimization for Role-aware Reasoning in Role-playing Agents

2026-05-25Computation and Language

Computation and Language
AI summary

The authors found that current methods to improve large language models' decision-making don't keep role-playing characters consistent and true to their style. They created a new approach called Character-Centric Group Relative Policy Optimization (CRPO) to fix this by separating task goals from style, adjusting learning rules based on how complex a character is, and using generic replies as a comparison to avoid bland answers. Their tests show that CRPO helps the model stay more in character and express emotions better than older methods.

Reinforcement LearningLarge Language ModelsGroup Relative Policy OptimizationRole-playing agentsCharacter fidelityStyle collapsePolicy optimization constraintsGradient conflictsNegative baselinesEmotion consistency
Authors
Yihong Tang, Kehai Chen, Liang Yue, Benyou Wang, Min Zhang
Abstract
Recent advancements in Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), have significantly enhanced the reasoning capabilities of Large Language Models. However, applying these problem-centric optimization methods to role-playing agents often leads to a loss of character fidelity and style collapse, as they prioritize context-specific utility over persona alignment. To address this, we propose Character-Centric Group Relative Policy Optimization (CRPO), a framework designed to realign RL objectives with the role-playing task. CRPO improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.