DRM: Diffusion-based Reward Model With Step-wise Guidance

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors point out that usual methods to make image-generating AI models match what people like often use tools that understand language and simple image meanings, but these tools miss important visual details like beauty and balance. They introduce a new method called Diffusion-based Reward Model (DRM) that uses an image-generating model itself to judge images at every step of creation, not just at the end. This step-by-step feedback helps their new training algorithm, Step-wise GRPO, give more precise rewards, making the AI learn better. They also create a new way to guide image creation during generation called Step-wise Sampling, which picks the best image paths along the way to improve quality. Their experiments show this approach makes generated images look better.

diffusion modelsreward modelvisual aestheticsreinforcement learningGRPO algorithmimage generationstep-wise evaluationinference strategyperceptual quality

Authors

Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu

Abstract

Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: https://github.com/jjaxonx/DRM.

View PDFOpen arXiv