Reinforcement Learning from Denoising Feedback
2026-05-25 • Computation and Language
Computation and LanguageMachine Learning
AI summaryⓘ
The authors address a common problem in teaching diffusion language models using reinforcement learning, which is how to correctly estimate the policy loss. They propose a new method called Reinforcement Learning from Denoising Feedback (RLDF) that uses feedback from the model’s rolling predictions and training to improve accuracy and efficiency. By focusing on optimizing from slightly noisy intermediate states, their approach balances speed and effectiveness. Experiments show that RLDF improves performance and can be applied to different diffusion language models. The authors also introduce a training framework called Drift to support this approach.
Reinforcement LearningDiffusion Language ModelsPolicy Loss EstimationDenoisingRolloutTimestep SamplingLLaDADreamGeneralizabilityTraining Framework
Authors
Qi He, Huan Chen, Ya Guo, Huijia Zhu, Yi R. Fung, Baojian Zhou
Abstract
Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (dLLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training paradigm that leverages feedback obtained from rollout and training processes to facilitate accurate and efficient policy loss estimation. To balance the trade-off between computational efficiency and estimation effectiveness, RLDF optimizes the model toward the clipped clean state $\hat{x}_0$ from intermediate noisy states $x_t$, combined with weighted timestep sampling over $t$. Extensive experiments demonstrate that RLDF achieves consistent and substantial improvements in both performance and generalizability across two representative dLLM architectures, LLaDA and Dream, on multiple reasoning benchmarks. Our work lays a principled foundation for scalable reinforcement learning in diffusion language models. We build Drift, a training framework for dLLMs, available at https://github.com/ant-research/Drift.