DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

2026-05-31 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors improved a language model called LLaDA-8B by teaching it to gently change all parts of a sentence together in a continuous way, rather than fixing some words and changing others bit by bit. They did this by adding small amounts of noise to every word's representation instead of just hiding words completely. This method helps the model avoid stopping too early or repeating itself when writing short summaries in just a few steps. It also helps the model fix mistakes in noisy inputs without messing up the correct words. Compared to standard methods, their approach better balances text length and quality with less training effort.

Discrete Masked Diffusion Language ModelsContinuous DenoisingEmbedding SpaceGaussian NoiseIterative Parallel DecodingROUGE ScoreZero-shot SummarizationToken MaskingPretrainingStochastic Localization

Authors

Longxuan Yu, Yunshu Wu, Yu Fu, Siheng Xiong, Rob Brekelmans, Hui Liu, Yue Dong, Greg Ver Steeg

Abstract

Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff between length and quality: with a fixed step budget, standard methods can generate a short, high-quality output, or they can produce long but repetitive text. Continuous denoising can sidestep this tradeoff by evolving all positions jointly in embedding space, but building such a model from scratch at scale remains an open problem. We show that a pretrained masked DLM can instead be lightly adapted to support continuous embedding-space denoising. Starting from LLaDA-8B-Instruct, we continue-pretrain for only 1,000 steps with Discrete Stochastic Localization (DSL), replacing binary masking with continuous per-token Gaussian noise as a soft mask. The adapted model supports continuous inference that evolves all positions jointly in embedding space and defers hard token commitment to the final step. On zero-shot summarization at low step budgets (<=16 forward passes), DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks and largely avoids the premature-termination / repetition tradeoff of iterative unmasking. The same adaptation also yields selective noisy-state robustness: the model corrects corrupted tokens while preserving clean ones. Control experiments using standard masked diffusion training with the same compute demonstrate neither behavior.

View PDFOpen arXiv