Bounded Ratio Reinforcement Learning

2026-04-20 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors address a gap between popular reinforcement learning methods by creating a new framework called Bounded Ratio Reinforcement Learning that offers a clear mathematical solution to improve policies step-by-step. They introduce an algorithm, Bounded Policy Optimization, which closely follows this solution to enhance learning performance and stability. Their work also explains why PPO, a widely used method, works well and connects it to other optimization techniques. They extend their approach for fine-tuning large language models and show through experiments that their methods perform as well or better than existing ones across different tasks.

Reinforcement LearningProximal Policy Optimization (PPO)Trust Region MethodsPolicy OptimizationBounded Ratio Reinforcement Learning (BRRL)Advantage-weighted DivergenceCross-Entropy Method (CEM)Large Language Models (LLMs)MuJoCoIsaacLab

Authors

Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai, Philipp Fürnstahl, Bernhard Schölkopf, Andreas Krause

Abstract

Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.

View PDFOpen arXiv