ScalingAttention: Discovering Intrinsic Sparse Attention Topology for Video Diffusion Transformers

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors identify that Diffusion Transformers (DiTs), used for high-quality video generation, are slowed down by a process called 3D full attention, which uses a lot of computing power. They introduce ScalingAttention, a method that finds stable, efficient attention patterns without extra training. Their approach separates finding these patterns from adjusting how much detail to keep, enabling faster processing without losing video quality. They also create special hardware support to make this speedup practical and show their method works well in experiments.

Diffusion Transformers3D Full AttentionSparse AttentionIntrinsic Sparse TopologyBlock-Sparse MaskWeight-EncodingDiffusion FidelityHardware AccelerationVideo Generation
Authors
Ruiliang Zhou, Xuecheng Wu, Kang He, Guangyun Han, Bin Liu, Qinqin Chen, Wende Xu, Qingjie Zhao, Chengru Song
Abstract
While Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, their reliance on 3D full attention creates a quadratic computational bottleneck. Existing sparse methods face a dilemma: dynamic pruning suffers from prohibitive runtime overhead and memory fragmentation, while static heuristics fail to capture fine-grained dependencies. In this work, we propose ScalingAttention, a training-free framework grounded in a key inductive bias: while individual activations are input-dependent, the high-mass attention regions for each head rapidly converge to a stable, prompt-agnostic Intrinsic Sparse Topology. This topology is weight-encoded, scale-invariant, and efficient to extract. ScalingAttention decouples topology discovery from sparsity control via: (1) WEST (Weight-Encoded Sparse Topology), which extracts a robust block-sparse prior mask offline to eliminate runtime search; (2) FAST (Fidelity-Aware Sensitivity Tuning), which adaptively tunes head-wise sparsity based on diffusion fidelity requirements. To ensure practical acceleration, we co-design a hardware-aligned bit-wise block-sparse kernel. Experiments on Wan2.1 show up to 1.90X end-to-end speedup with superior fidelity, establishing a new Pareto frontier over state-of-the-art baselines.