DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

2026-06-01Computation and Language

Computation and Language
AI summary

The authors propose a new method called AngelSlim to speed up large language model inference by predicting multiple tokens at once more effectively. Unlike earlier approaches that limited how draft layers learn from the main model, their technique lets each draft layer use its own customized mix of information from many target model layers. This allows the draft model to be deeper and learn better, leading to faster and more accurate predictions. They tested AngelSlim on various tasks and found it consistently outperforms previous methods while running several times faster.

block diffusion speculative decodinglarge language modelsdraft modeltarget modellayer-wise fusionmodel scalinginference speedupQwen3GPT-OSSparallel verification
Authors
Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li, Dawei Zhu, Jiangshan Duo, Weimin Xiong, Yifan Song, Guanghua Yu, Jianchen Zhu, Sujian Li
Abstract
Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per-layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B, improving over DFlash by roughly 11\%, 8\%, and 5\% respectively. Our code is available at https://github.com/Tencent/AngelSlim.