DiSC: Resolution-Scalable Acceleration of Diffusion Models by Exploiting Sparsity and Cached Token Reuse with Hash-based Distribution
2026-05-25 • Hardware Architecture
Hardware Architecture
AI summaryⓘ
The authors address the challenge that transformer-based diffusion models are slow and costly to run at high resolutions due to complex calculations. They introduce DiSC, a new hardware accelerator paired with two software techniques that reuse parts of the computation and skip unnecessary work by recognizing patterns over time. This combined approach makes the model more efficient without extra hardware complexity. Testing shows DiSC runs these models several times faster and uses much less energy compared to leading GPUs. Overall, the authors provide a practical way to speed up and save power on large, detailed image generation tasks.
TransformerDiffusion ModelsSelf-AttentionSparsityHardware AcceleratorToken ReuseSoftmax ThresholdingOn-Chip MemoryDataflowEnergy Efficiency
Authors
Jieon Yoon, Hangyeol Lee, Jaehoon Heo, Joo-Young Kim
Abstract
Transformer-based diffusion models offer superior scalability and performance but suffer from high computational overhead due to the iterative nature and quadratic complexity of self-attention at high resolutions. In this paper, we propose DiSC, a resolution-scalable, sparsity-aware hardware accelerator. At the software level, DiSC introduces two algorithms: Cached Token Reuse (CTR), and Softmax Thresholding with Sparsity Mask Reuse (ST). CTR introduces a mechanism that translates spatial variations in the input latent difference across steps into a token-level reuse decision, effectively eliminating redundant token computation. ST induces sparsity in attention operations by reusing a generated sparsity pattern, leveraging temporal similarity to bypass costly prediction overhead. Together, these algorithms provide resolution-scalable computational benefits and yield a moderate sparsity and hybrid dense-sparse workload. To exploit this efficiently, we design a specialized hardware architecture and unified dataflow. This architecture avoids dedicated sparsity-handling components; instead, a hash-based distribution over on-chip memory banks allows DiSC to reuse its existing compute engines for sparse operations, efficiently exploiting the induced sparsity with minimal hardware overhead. Evaluated on DiT and PixArt-Sigma, DiSC achieves 3.47-4.74x and 2.48-3.50x speedups over NVIDIA A100 and H100 GPUs, respectively, with energy savings ranging from 46.4% to 68.1%.