LESS Is More: Mutual-Stability Sampling for Diffusion Language Models

2026-06-15 • Computation and Language

Computation and Language

AI summaryⓘ

The authors propose LESS, a new technique for more efficiently generating text with diffusion-based large language models. Unlike previous methods that use a fixed number of steps and waste effort on stable parts, LESS decides when to finalize each token based on confidence and stability criteria. This adaptive approach reduces the number of processing steps needed while improving accuracy. The authors tested LESS on several models and tasks, showing it speeds up inference without extra training.

diffusion large language modelsautoregressive decodingsamplingreverse denoising stepstoken commitmentonline stopping problemJensen–Shannon divergenceTransformerinference latencyadaptive sampling

Authors

Amr Mohamed, Guokan Shang, Michalis Vazirgiannis

Abstract

Diffusion large language models (dLLMs) offer a promising alternative to autoregressive decoding by iteratively refining masked sequences, enabling parallel token updates and bidirectional conditioning. Their practical efficiency, however, is limited by sampling procedures that execute a fixed number of reverse denoising steps selected before decoding, spending computation on already-stable positions and sometimes committing unstable ones too early. We present \textsc{LESS}, a training-free, model-agnostic adaptive sampler that treats token commitment as an online stopping problem. \textsc{LESS} implements mutual-stability sampling through a joint stability rule that makes a masked position eligible for unmasking only when its top-1 prediction has high confidence, its top-1 token persists across recent reverse steps, and its predictive distribution is stable under top-$K$ inter-step Jensen--Shannon divergence. We evaluate \textsc{LESS} on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B, covering full-sequence diffusion and semi-autoregressive blockwise sampling regimes, across seven benchmarks spanning general knowledge, math, and code. \textsc{LESS} improves average accuracy over strong training-free adaptive samplers while using $72.1\%$ fewer reverse steps than fixed-budget decoding. Since each reverse step requires a Transformer forward pass, these step-count reductions translate into fewer forward evaluations, lower measured wall-clock latency, and lower estimated inference compute.

View PDFOpen arXiv