SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers
2026-06-22 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors introduce SpotAttention, a new method to speed up large language models when reading very long texts. Their approach smartly picks only the most relevant parts to focus on, instead of reviewing everything, which saves a lot of computing power. SpotAttention learns to predict where attention should go without changing the original model, making it faster and efficient even with very long inputs. Their tests show it can handle much longer texts than before, running several times faster than previous methods without losing accuracy.
long contextpretrained language modelssparse attentionKL distillationkey-value cachetop-K selectionquantizationFlashAttentiontransformerdecode speed
Authors
Huzama Ahmad, Se-Young Yun
Abstract
Long contexts have become standard in pretrained LLMs, yet they remain expensive to run: prefill compute grows quadratically with sequence length, and every decode step re-reads a key-value cache that grows linearly with it. Sparse attention cuts these costs by attending only to a relevant subset of past tokens, but selecting that subset is itself expensive. We present SpotAttention, a lightweight selector that attaches to a frozen pretrained transformer and learns by KL distillation to estimate its attention distribution. The selector picks the top-K keys each query attends to, and because its estimate is a calibrated distribution, a dual top-p rule reads the per-query, per-layer budget directly from it. Across Qwen3 (dense, 4B-32B) and Qwen3.5 (hybrid linear/full attention, 4B-9B), SpotAttention matches dense accuracy at contexts up to 128K tokens, eight times the training length. Decode at L=128K runs 3.9x faster than FlashAttention and 1.8x faster than Twilight, the strongest training-free baseline. Quantizing the selector's K-cache to INT4 or FP4 microscale shrinks it 3.5x at no accuracy cost.