Experience Augmented Policy Optimization for LLM Reasoning

2026-06-29Machine Learning

Machine Learning
AI summary

The authors address challenges in teaching large language models (LLMs) to improve reasoning through a method called Reinforcement Learning with Verifiable Rewards (RLVR). They note that previous approaches are inefficient because they retrain from scratch and struggle to reuse past experiences properly. To fix this, their method, Experience-Augmented Policy Optimization (EAPO), reuses experiences flexibly during decision-making and applies a special technique to keep learning stable. Tests on different models and benchmarks show that EAPO improves reasoning abilities better than existing methods.

Reinforcement LearningLarge Language ModelsPolicy OptimizationExperience ReplayImportance SamplingReasoning TasksOn-Policy LearningQwen ModelsRolloutVerifiable Rewards
Authors
Jinda Lu, Kexin Huang, Junkang Wu, Shuo Yang, Jinghan Li, Chiyu Ma, Shaohang Wei, Xiang Wang, Guoyin Wang, Jingren Zhou
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for improving the reasoning capabilities of large language models (LLMs). However, existing RLVR methods typically rely on on-policy optimization from scratch, resulting in high sampling costs and inefficient utilization of accumulated experience. As model capabilities and policy behaviors evolve during training, recent attempts to reuse experience via fixed reasoning trajectories further suffer from policy mismatch. Motivated by these limitations, we argue that experience in RLVR should not be reused as fixed reasoning trajectories, but instead expressed in a policy-adaptive manner. In this work, we propose Experience-Augmented Policy Optimization (EAPO), which leverages a prior RL-optimized policy as an action-level experience prior and selectively injects experience at critical decision points during rollout. To ensure stable and unbiased learning from experience-augmented rollouts, EAPO further incorporates an adapted importance sampling scheme. Experiments on using Qwen-2.5-math 7b and Qwen-3-8B on five different benchmarks demonstrate that EAPO consistently improves reasoning performance over state-of-the-art RLVR methods.