Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors developed a new method called RTDMD to help image generators create better pictures quickly, using only a few steps. They combined two ideas: matching distributions to teach the model and guiding learning with rewards based on how good an image is. Their method works in two parts, first improving how the model matches the desired image style, then optimizing it to get higher rewards efficiently. Tests show their approach makes better images faster than previous methods on popular datasets.

diffusion modelsdistillationKL divergencereinforcement learningpolicy gradientreward maximizationdistribution matchingfew-step generationtext-to-image synthesismodel compression

Authors

Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang, Tianyu Pang

Abstract

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

View PDFOpen arXiv