TokenPilot: Cache-Efficient Context Management for LLM Agents

2026-06-15Computation and Language

Computation and LanguageArtificial IntelligenceMachine LearningMultiagent Systems
AI summary

The authors address the problem of expensive computations in large language model sessions when a lot of context builds up. They propose TokenPilot, a system that carefully manages which parts of the context to keep or remove without messing up the order, so the model can reuse previous work efficiently. Their method works in two ways: globally trimming irrelevant information early and locally removing context only when it's no longer useful. Tests show TokenPilot cuts costs significantly while keeping performance similar to existing methods. The authors have also made their work available in an open-source project called LightMem2.

large language modelscontext managementinference costprompt cachingtoken pruningmemory evictioncontext compactiontask relevancelong-horizon sessionsLightMem2
Authors
Buqiang Xu, Zirui Xue, Dianmou Chen, Chenyang Fu, Chiyu Wu, Caiying Huang, Chen Jiang, Jizhan Fang, Xinle Deng, Yijun Chen, Yunzhi Yao, Xuehai Wang, Jin Shang, Gong Yu, Ningyu Zhang
Abstract
As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.