CacheWise: Understanding Workloads and Optimizing KVCache Management for Efficiently Serving LLM Coding Agents

2026-06-15Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingOperating Systems
AI summary

The authors studied how AI coding helpers work when they write code step-by-step and use external tools. They found that these coding sessions keep reusing lots of previously generated information, which creates problems for current systems that manage memory. To fix this, the authors created CacheWise, a smarter memory manager that better saves and reuses important information. When tested, CacheWise made the coding helpers run faster and lose less memory data.

LLMcoding agentsKVCacheprefix reusetool callsmemory managementschedulingevictionvLLM
Authors
Shubham Tiwari, Tapan Chugh, Nash Rickert, Simon Peter, Ratul Mahajan, Haiying Shen
Abstract
Coding agents are a fast-growing LLM application, executing as long-running closed-loop sessions in which LLM generations alternate with external tool calls. Yet, unlike chat workloads, their serving behavior has not been studied extensively. We address this gap by collecting a dataset of real-world coding assistant traces. Our analysis shows that coding agent sessions repeatedly reuse large prefixes and create sustained KVCache pressure that conventional LLM serving policies handle poorly. Based on our analysis, we present CacheWise, a KVCache management layer that improves KVCache reuse for coding agent workloads. CacheWise combines prefix-aware scheduling with reuse-aware eviction guided by lightweight predictions from tool call metadata. Implemented in vLLM and evaluated on the collected traces, CacheWise reduces KVCache evictions by up to 2-2.6x and improves total agent session completion time by up to 3.5x.