AI summaryⓘ
The authors developed EvoRubrics, a system where a model that solves tasks (the Policy LLM) and a model that judges its answers (the Rubric Generator) learn together by challenging each other at every training step. Unlike previous methods that use fixed or slowly updated rules for scoring, their approach adapts the scoring criteria as the solving model improves, preventing the scoring from becoming useless over time. This co-evolution helps the model keep learning better without needing external answers or supervisors. Their experiments show this method works better than others, and the scoring model they train can also be used on new tasks. Even when no outside help is given, their system still improves meaningfully, highlighting the benefit of this self-supervised learning setup.
reinforcement learningpolicy modelrubric-based rewardco-evolutionary frameworklarge language model (LLM)self-supervised learningreward saturationautomatic curriculumadversarial interactiontransferable reward model
Authors
Hongxin Ding, Baixiang Huang, Yue Fang, Weibin Liao, Zheng Li, Jinyang Zhang, Zhijing Wu, Junfeng Zhao, Yasha Wang
Abstract
Rubric-based rewards offer interpretable and fine-grained optimization signals for reinforcement learning in open-ended tasks where verifiable answers are unavailable. However, pre-constructed rubrics remain static throughout training, creating a fundamental mismatch with the evolving policy: fixed criteria gradually lose discriminative power as the model improves, leading to reward saturation and potential hacking. Recent dynamic rubric methods partially address this but rely on external frontier models or ground-truth answers, and update rubrics only at coarse granularity. We propose EvoRubrics, a co-evolutionary RL framework where a Policy LLM and a Rubric Generator jointly improve through adversarial interaction within each training step. As the policy improves under the rubric generator's guidance, the rubric generator adapts its criteria to remain discriminative and informative, enabling evaluation to track the policy in real time and naturally inducing an automatic curriculum. Experiments show that EvoRubrics consistently outperforms static and dynamic rubric baselines across benchmarks. The learned Rubric Generator further generalizes as a transferable reward model. Notably, even a fully self-supervised variant without any external supervision achieves meaningful gains, suggesting that co-evolution between generation and evaluation alone can provide sufficiently rich learning signals. Our code is publicly available at https://anonymous.4open.science/r/EvoRubrics-2155/.