IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference
2026-05-25 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors address the problem that large language models struggle to remember long text because their memory grows too big with longer inputs. They create a smart system that learns which parts of the text are most important to keep, instead of guessing. To avoid losing information when less important parts are removed, they add a small memory module that compresses and stores them. This approach helps the models work better with long texts while using limited memory, improving scores and retrieval accuracy.
Large Language Modelssoftmax attentionKV cacheeviction policieslatent memorytoken importancelong-context inferencecompressionretrievalbounded memory
Authors
Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo
Abstract
Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we introduce a learnable indexer that predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweight latent memory module that compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.