Make Your LVLM KV Cache More Lightweight

2026-05-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors found that a key component used in big vision-language AI models, called the Key-Value (KV) cache, uses a lot of GPU memory because it stores many vision tokens. They created a new method named LightKV that shrinks this cache by combining similar vision token information, guided by the text prompts the model receives. This approach helps the model store fewer tokens without losing important details. Tests on several models and datasets showed that LightKV can cut the cache size in half, reduce computing work by up to 40%, and still keep performance strong compared to older methods.

Key-Value cacheLarge Vision-Language Modelsvision tokensGPU memorycross-modality message passingtoken compressionprefill stageprompt-guided aggregationmodel inference efficiency

Authors

Xihao Chen, Yangyang Guo, Roger Zimmermann

Abstract

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.

View PDFOpen arXiv