From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs
2026-06-08 • Artificial Intelligence
Artificial IntelligenceComputation and Language
AI summaryⓘ
The authors found that different parts of a language model's attention system behave very differently when reading long texts, with some parts staying very focused and others changing a lot. They created EntropyInfer, a method that watches these behaviors in real-time to decide where to spend computing power more efficiently during text processing. This method also smartly compresses stored information based on the words the model has generated so far. Their tests showed EntropyInfer speeds up long-text processing with only small drops in accuracy compared to standard methods.
Sparse AttentionKV Cache CompressionAttention HeadsEntropyPrefillDecodingLanguage ModelsLlamaQwenCache Entries
Authors
Zhanchao Xu, Haoyang Li, Qingfa Xiao, Fei Teng, Chen Jason Zhang, Lei Chen, Qing Li
Abstract
Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.