AI summaryⓘ
The authors study how to improve reinforcement learning for large language models, where there is a trade-off between learning quickly from past examples (off-policy) and staying true to the current model (on-policy). They find that updating too many times on fixed data can cause the model's behavior to drift and stop improving. To fix this, they propose a two-step method called ERPD: first, learn aggressively from fixed data without strict limits, then carefully distill the useful knowledge back into the main model while avoiding harmful changes. Their approach works well on math reasoning tasks, helping models improve even when traditional training gets stuck, and can learn effectively from both strong and weak guidance.
reinforcement learninglarge language modelsoff-policy learningon-policy learningtrust-region methodsKL divergencepolicy distillationmulti-step optimizationmathematical reasoning
Authors
Changyu Chen, Xiting Wang, Rui Yan
Abstract
Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse introduces distribution mismatch that existing trust-region techniques mitigate primarily by enforcing conservative optimization, often leaving rich training signals underutilized. To investigate this, we perform extensive off-policy updates on fixed data. Our experiments reveal that aggressive multi-step optimization brings rapid initial gains, but excessive updates cause trajectory probabilities to deviate and entropy to collapse, with performance plateauing early. Tightening KL constraints merely lowers the ceiling without resolving the degradation. This motivates Extreme Region Policy Distillation (ERPD), a two-stage framework that decouples sample efficiency from KL efficiency. The first stage performs weakly constrained off-policy optimization on fixed data to maximally extract training signals. The resulting policy provides token-level supervision. In the second stage, we distill these signals into the base policy under trust-region constraints, filtering harmful drift while preserving useful signals. The distilled policy achieves comparable or better performance with substantially smaller KL divergence, indicating that much of the first-stage divergence was spent on unnecessary drift rather than genuine improvement. Crucially, ERPD accommodates both strong and weak teachers: when aggressive optimization yields no stronger policy, even degenerate teachers provide effective supervision via alternative signal construction strategies. We validate ERPD on mathematical reasoning, showing gains for strong base models where on-policy training plateaus, and reliable improvements with weak teachers.