Training-free sparse attention based on cumulative energy filtering
2026-06-15 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors studied how to speed up video generation using Diffusion Transformers by only focusing on important parts (tokens) instead of all of them. They found that keeping a consistent recall rate helps maintain accuracy better than using a fixed cutoff threshold. The paper introduces a dynamic threshold method that improves how many tokens can be skipped without losing much accuracy. They also made their method work smoothly with Flash Attention, reducing extra computation. Their tests showed the new approach is more efficient and faster than previous methods while keeping video quality nearly the same.
Sparse AttentionDiffusion TransformersToken SelectionDynamic ThresholdingRecall RateFlash AttentionComputational EfficiencyVideo GenerationTop-kTop-p
Authors
Chunlu Li, Yixuan Pan, Bai Du, Zhenyuan Chen, Yanzhao Li, Hui Dong, Hui Wang, Zhiqiang Zou
Abstract
Sparse attention accelerates Diffusion Transformers (DiTs) for video generation by computing only the important tokens while skipping the rest. The token selection strategy is key to balancing sparsity and accuracy. We formulate the token filtering process as a dual-goal optimization problem: maximizing sparsity and minimizing accuracy degradation. Existing algorithms cannot fulfill both objectives simultaneously. For example, Top-p only considers the accuracy constraint, while Top-k maintains a fixed computational budget but loosens the accuracy constraint. This paper demonstrates that maintaining a fixed recall rate is sufficient for ensuring accuracy, whereas a fixed threshold is suboptimal for reducing computational cost. Therefore, we propose a dynamic thresholding scheme to improve sparsity while maintaining the same level of accuracy. Furthermore, our algorithm is deeply integrated with Flash Attention (FA), eliminating the need for any additional masking computation overhead. Experimental results on Wan 2.2 validate that, compared to the BLASST algorithm which is also integrated with FA, our dynamic thresholding strategy enhances sparsity from 61.42\% to 82\% with a VBench metric drop of less than 5\%. This results in an approximate 15\% in attention computation and a $1.61\times$ increase in computational efficiency, which is 1.18x higher than that of BLASST.