When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

2026-05-25 • Machine Learning

Machine LearningComputation and Language

AI summaryⓘ

The authors explore how to improve training of large language models that learn from rewards, which usually need expensive true labels to work well. They propose a new method called RLAVR that smartly picks a few important samples to get real labels while using cheaper fake labels for the rest, making training more stable. To decide which samples to pick, they introduce a special scoring method called Corrective Advantage Gap and develop CARE, a practical strategy to apply this scoring. Their experiments show this approach works well across different models and tasks. They also provide their code for others to use.

Large Language ModelsReinforcement LearningReward FunctionGround-truth LabelsPseudo-labelsTraining StabilitySample SelectionCorrective Advantage GapActive LearningAnnotation Budget

Authors

Li Wang, Xiaodong Lu, Xiaohan Wang, Yikun Ban, Jiajun Chai, Wei Lin, Tianhao Peng, Guojun Yin

Abstract

Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.

View PDFOpen arXiv