Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

2026-06-29Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors introduce Nemotron-Labs-Diffusion-Image, a new model for creating high-quality images from text using a masked discrete diffusion approach. They solve two main problems: the model can now edit previously generated parts of the image during creation, making it more accurate, and they improve training by giving learning signals to similar tokens instead of just exact matches. To make training faster and more memory-efficient, they design a special operator for their method. Their experiments show that these improvements lead to better image quality and training performance compared to previous models.

masked discrete diffusion modeltext-to-image synthesistoken-editing mechanismvocabulary sizeoptimizationGrouped Cross-Entropyembedding spacetraining efficiencyGenEvalVRAM
Authors
Shufan Li, Greg Heinrich, Hanrong Ye, Yonggan Fu, Aditya Grover, Jan Kautz, Pavlo Molchanov
Abstract
We propose Nemotron-Labs-Diffusion-Image, a state-of-the-art masked discrete diffusion model (MDM) for high-resolution text-to-image synthesis. Compared with prior work on masked image generation, Nemotron-Labs-Diffusion-Image addresses two key challenges. First, unlike continuous diffusion models which progressively refine latent representations across the entire image, standard MDMs lack self-correcting capability because discrete tokens cannot be modified once they are unmasked. Second, although increasing the vocabulary size of discrete image tokenizers improves reconstruction fidelity, it introduces optimization difficulties for generative modeling as the per-token training signal becomes increasingly sparse. To address the first challenge, Nemotron-Labs-Diffusion-Image incorporates a token-editing mechanism that enables the model to dynamically revise already-unmasked tokens during inference, similar to how a sculptor iteratively refines their work. To tackle the second challenge, we propose a Grouped Cross-Entropy (GCE) objective that assigns positive learning signals to tokens neighboring the ground truth in embedding space, thereby alleviating signal sparsity. To further improve training efficiency, we implement a custom fused operator for GCE that significantly reduces VRAM usage in large-vocabulary settings. Experimental results demonstrate that these innovations substantially improve both training efficiency and image fidelity of masked discrete image generators, achieving a score of 0.90 on GenEval, 86.9 on DPG and 10.76 of HPSv3.