SemBlock: Semantic Boundary Dynamic Blocks for Diffusion LLMs
2026-06-03 • Computation and Language
Computation and Language
AI summaryⓘ
The authors propose SemBlock, a new way to improve how diffusion language models generate text by choosing where to finish each chunk based on meaning, rather than fixed sizes or simple markers. They teach the model to recognize semantic boundaries using a special dataset called SemBound, which marks breaks in ideas from language, math, and code examples. Their method helps the model decide chunk endings dynamically, leading to better performance than previous fixed or heuristic-based chunking methods. They tested this on several reasoning and coding tasks and saw consistent improvements.
Diffusion Language ModelsBlockwise DecodingSemantic BoundariesDynamic Block ConstructionLLaDADiscourse UnitsReasoning StepsNatural Language ProcessingCode GenerationGSM8K dataset
Authors
Xinrui Song, Zhuoran Wang, Mingju Gao, Hao Tang
Abstract
Diffusion language models (DLMs) generate text through iterative denoising, and blockwise decoding improves their practicality by committing tokens in local blocks. However, existing blockwise methods typically rely on fixed block sizes or delimiter-based runtime signals, which do not necessarily align with semantic boundaries. In this paper, we propose SemBlock, a semantic-boundary-driven dynamic block decoding framework for diffusion LLMs. SemBlock formulates dynamic block construction as semantic boundary prediction and trains lightweight predictors on frozen LLaDA hidden states. To provide supervision, we construct SemBound, a semantic-boundary dataset that derives boundary labels from discourse units, reasoning steps, and implementation spans across natural language, math, and code tasks. During inference, SemBlock uses predicted boundary probabilities to select the ending position of each dynamic block. Experiments on GSM8K, IFEval, MATH, and HumanEval show that SemBlock consistently improves over fixed-block decoding and AdaBlock. Our code is publicly available: https://github.com/TH-AI-Lab-PKU/SemBlock.