SimSD: Simple Speculative Decoding in Diffusion Language Models

2026-06-01 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors study diffusion large language models (dLLMs), which can generate text faster than traditional autoregressive (AR) models but have trouble using a popular speed-up method called speculative decoding. They propose SimSD, a new way to let dLLMs check multiple predicted tokens at once by carefully controlling which tokens the model pays attention to. This approach keeps the speed benefits of dLLMs while allowing a fast verification process like AR models have. Their method works without retraining and improves decoding speed significantly, sometimes by over seven times, without hurting text quality.

diffusion large language modelsautoregressive modelsspeculative decodingmasked language modelingcausal maskingattention maskparallel decodingverification tokensKV cacheblockwise decoding

Authors

Junxia Cui, Haotian Ye, Runchu Tian, Hongcan Guo, Jinya Jiang, Haoru Li, Chaojie Ren, Yiming Huang, Kaijie Zhu, Zhongkai Yu, Kun Zhou, Jingbo Shang

Abstract

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.

View PDFOpen arXiv