Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

2026-06-22Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors explain that multimodal agents, which process videos and images multiple times, usually re-encode data from scratch each time they revisit the same content, causing inefficiency. They find that simply reusing cached information misses subtle, important connections between chunks of data, harming complex reasoning while simple recall stays fine. To fix this, the authors introduce a small, training-free patch that preserves these cross-chunk relationships, allowing faster, more accurate multi-step reasoning without full re-encoding. Their method works well across various tasks and reduces memory and computation needs, especially benefiting agents dealing with visual and video inputs.

multimodal agentscontext windowKV cachecross-chunk conditioningmulti-hop reasoningRoPE re-rotationlow-rank patchsliding windowvideo processingmemory footprint
Authors
Bole Ma, Jan Eitzinger, Harald Koestler, Gerhard Wellein
Abstract
Multimodal agents repeatedly re-examine the same video frames, UI screenshots, and rendered artifacts as their context window slides and reasoning iterates, yet every look-back re-encodes from scratch, because prefix caches serve reuse only at a fixed leading position. We show this recompute is avoidable, and identify exactly what naive KV reuse loses: the cross-chunk conditioning a chunk absorbs from its neighbours. This loss is asymmetric. The direct readout of a cached chunk is recovered exactly and for free by the standard state-merge. What remains is a diffuse, low-rank residue concentrated in deep layers, invisible to single-hop retrieval but precisely what multi-hop reasoning binds on. Blind reuse therefore leaves single-hop recall intact while halving multi-hop accuracy; this is the failure mode prior position-independent caches, designed for single-context or single-image reuse, do not address. We repair it with a small, training-free low-rank conditioning patch stored alongside each position-free chunk. Reuse reduces to one operator across MLA, GQA, and MHA: exact RoPE re-rotation to any target position, plus the patch that restores cross-chunk binding. This makes three window operations cheap: reorder (one patch serves every ordering of a cached set), sliding-window survival (surviving chunks relocate via rotation only, zero re-encode), and recall (an evicted chunk is rehydrated by its patch, never re-encoded). A rank-m patch recovers full task accuracy on cross-chunk-binding benchmarks, MM-NIAH across two attention families and two-page doc-QA, at a fraction of the KV footprint, and reconstructs re-prefill KV to within bf16 rounding in a production SGLang kernel across six backbones. The conditioning signal is strongest in redundant vision and video streams, making our solution most impactful where multimodal agents spend their recompute budget.