Triplet-Block Diffusion RWKV
2026-05-25 • Computation and Language
Computation and Language
AI summaryⓘ
The authors address the problem that traditional causal Transformer language models are slow because they decode one word at a time and use lots of computing in each step. They combine two existing approaches—linear-time causal models and discrete diffusion models—but these usually don’t work well together because diffusion needs to look both forward and backward, while causal models only look forward. Their new method, called B³D-RWKV, uses a special layout that allows efficient, parallel, and bidirectional processing without slowing down inference. In tests, their model matches the accuracy of others on several tasks but decodes faster by about 1.6 times.
causal Transformerdecodingattention costdiscrete diffusion modelsbidirectional attentionunidirectional attentionRWKVparallel processinginference efficiencytriplet-block layout
Authors
Ke Lin, Yiyang Luo, Zhaolong Su, Yunya Song, Anyi Rao
Abstract
Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and discrete diffusion models each address these weaknesses, their integration remains inherently inconsistent: diffusion requires bidirectional attention, while causal models are unidirectional. To unify these architectures, we propose $B^3D-RWKV$, a diffusion RWKV variant that integrates the model's $O(L)$ inference efficiency with parallel, bidirectional discrete-diffusion through a \emph{triplet-block layout} method. $B^3D-RWKV-7.2B$ reaches comparable accuracy on an 8-task suite versus existing models while significantly outperforming baselines in decoding throughput with an average of $\mathbf{1.6\times}$ speedup.