SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving

2026-06-01Performance

Performance
AI summary

The authors address a problem in large language models where reusing parts of previous text (cached prefixes) speeds up response time, but real-world text reuse is often more complicated than just prefix reuse. They introduce SparseX, a method that breaks text into segments and selectively updates key parts of the model’s memory to handle complex, interleaved reuse patterns efficiently. SparseX combines full attention in early model layers with sparse recomputation in later layers to maintain accuracy while improving speed. Their system works without extra training and supports common applications like chat and retrieval-based tasks.

Long-context LLMPrefill stageKV CachePrefix CacheSparse attentionPagedAttentionRoPE alignmentFlashAttentionRetrieval-augmented generation (RAG)Multi-round chat
Authors
Quqing Zhang, Kai Chen, Ning Liao, Zehao Lin, Bo Tang, Feiyu Xiong, Xiaoxing Wang
Abstract
In long-context LLM serving, the prefill stage often dominates time-to-first-token and computational cost. Although Prefix Cache in vLLM/PagedAttention has been widely used to reuse identical prompt prefixes, repeated content in practical applications frequently appears as non-prefix, cross-request, cross-turn, and cross-agent segments, which makes conventional cache mechanisms insufficient. This paper presents SparseX, a segment-level KV Cache sharing method for common serving scenarios. SparseX uses contiguous token segments as reuse units and exploits Sparse-Q indices that naturally arise in KV Cache reuse workloads to estimate the key tokens that require correction. Based on this estimate, SparseX performs Sparse-KV Recomputation within a single forward pass, thereby restoring cross-segment contextual interactions under complex interleaved reuse patterns while avoiding additional models or separate preprocessing stages for token selection. SparseX further implements a full+sparse hybrid attention mode based on a layer-specific threshold: early layers retain full attention to obtain a more stable token-importance signal, and later layers switch to sparse recomputation to improve reuse quality on complex long-context tasks. We implement SparseX-vLLM on top of vLLM, integrating segment-level cache lookup, PagedAttention management, RoPE alignment, Sparse-Q token selection, and FlashAttention backends into a unified execution path. SparseX is model-agnostic, training-free, and compatible with Prefix Cache, and it provides unified support for common online serving scenarios including multi-round chat, retrieval-augmented generation (RAG), and agent workflows.