Relative Score Policy Optimization for Diffusion Language Models
2026-05-11 • Computation and Language
Computation and Language
AI summaryⓘ
The authors looked at a type of language model called diffusion large language models (dLLMs) that can generate text efficiently but need improvement in reasoning. They noticed that traditional reinforcement learning methods struggle with these models because they require hard-to-calculate values called sequence-level log-ratios. To fix this, the authors created a new method, RSPO, that adjusts the model based on comparing noisy estimates with targets derived from rewards, making training more stable. Their tests showed RSPO works well, especially on planning problems, and also performs competitively on math reasoning tasks.
diffusion large language modelsreinforcement learningpolicy optimizationsequence-level log-ratiosELBOreward advantagerelative log-ratiomathematical reasoningplanning tasks
Authors
Zichao Yu, Shengze Xu, Bingqing Jiang, Wenyi Zhang, Difan Zou
Abstract
Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key observation: a reward advantage can be interpreted not only as an update direction, but also as a target for the relative log-ratio between the current and reference policies. Accordingly, RSPO calibrates this noisy relative log-ratio estimate by comparing its reward advantage with the reward-implied target relative log-ratio, updating the policy according to the gap between the current estimate and the target rather than the raw advantage alone. Experiments on mathematical reasoning and planning benchmarks show that RSPO yields especially strong gains on planning tasks and competitive mathematical-reasoning performance.