MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference

2026-06-01Machine Learning

Machine Learning
AI summary

The authors studied how Transformer language models remember information during text generation, finding that throwing away certain remembered parts causes big mistakes because the discarded and kept information are very different. They created MomentKV, a method that keeps simple summary statistics of the discarded parts to better understand and correct what was lost. This approach helps the model make better predictions even when it has limited memory for past information. Tests showed that MomentKV consistently performs better than other methods, especially when memory is very tight.

Transformerautoregressive decodingKV cachelong-context inferencecache evictionattention mechanismmoment statisticsLLaMAQwenLongBench
Authors
Yu Li, Binxu Li, Tian Lan
Abstract
Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction addresses this by retaining a fixed-size subset of key-value pairs and discarding the rest. We identify that a primary source of output degradation is not the residual attention mass on evicted tokens, which existing methods already minimize, but a directional mismatch between the retained and evicted token sets. Specifically, the evicted tokens in practice are often near-orthogonal to the retained ones. Thus, even a small evicted mass could have an oversized impact on the resulting direction distribution and amplify into substantial output error. This reveals a fundamental limit in existing strategies. To address this, we propose MomentKV, which maintains compact, small-size moment statistics over the evicted token set, including a count, key mean, value mean, and value-key covariance. During eviction, the moment statistics is leveraged to identify tokens already well aligned with and captured by the accumulated summary, keeping the evicted set geometrically regular. During inference, they yield a closed-form first-order approximation of the evicted attention output, forming a mutually reinforcing loop between selective eviction and accurate correction. On LongBench and RULER with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, MomentKV outperforms all baselines at every cache budget, with the largest gains under aggressive compression.