AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

2026-05-25Machine Learning

Machine LearningArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors present AdvantageFlow, a new way to improve reinforcement learning models that work by gradually transforming data (called rectified flow models). Instead of focusing on the backward steps like previous methods, they optimize the forward steps using a special loss based on advantage values. Since this optimization can become unstable when these advantages are negative, the authors introduce a technique called rollout policy regularization to keep things stable by controlling variance. They tested AdvantageFlow on image generation with Stable Diffusion 3.5 Medium and found it works better than earlier methods.

Advantage-weighted lossRectified flow modelsForward-process optimizationReinforcement learningRollout policy regularizationNon-convex optimizationStable DiffusionImage generationFlow-GRPO
Authors
Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, Krishna Kumar Singh, Viet Dac Lai
Abstract
We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes the reverse process, we optimize an advantage-weighted forward-process prediction loss. This optimization problem is unstable when advantages are negative and the loss becomes non-convex. We stabilize it by rollout policy regularization, which reduces variance and arises from fitting a local reward-improving target distribution. We evaluate AdvantageFlow on image generation tasks with Stable Diffusion 3.5 Medium. It outperforms both Flow-GRPO and a state-of-the-art forward-process RL baseline based on negative-aware fine-tuning.