What Should a Streaming Video Model Remember?

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors study models that need to understand videos in real-time, using limited memory and computing power. They propose SelectStream, a system that smartly chooses which past information to keep and when to use it, instead of storing everything. Their method keeps the current video frame fully visible while summarizing important past details in a compact way to help answer questions about the video. Tests show SelectStream works well compared to other methods, balancing memory use and accuracy effectively.

streaming video understandinglatent memoryvideo language models (VLM)adaptive windowingevidence allocationmemory consolidationquery-conditioned reasoningreal-time inferencecompact representationvideo benchmarks
Authors
Haonan Ge, Yiwei Wang, Hang Wu, Yujun Cai
Abstract
Streaming video understanding models must answer queries at any moment during an ongoing stream, using only what they have observed so far and under fixed memory and computation budgets. Existing methods address this by adding memory banks, retrieval modules, or visual token compression to preserve long-range history. However, strong recent-window baselines show that indiscriminate history injection can dilute current-scene perception, suggesting that the key challenge is not whether to use memory, but how to allocate it selectively. We formulate this as budgeted online latent evidence allocation and propose \textbf{SelectStream}, a selective latent-memory framework that keeps the current observation directly visible to a frozen VLM while exposing historical information only through a compact, query-conditioned evidence budget. Three coordinated mechanisms govern when to write, what to preserve, and how to retrieve: surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph. Retrieved evidence is calibrated and injected as latent tokens for answer generation, without replaying frames or growing the context with stream length. Experimental results show that SelectStream achieves strong online streaming performance and preserves general video understanding, reaching 82.67\% on StreamingBench, 67.03\% on OVO-Bench, and 74.4\% average accuracy on offline video benchmarks, while outperforming strong recent-window baselines and prior streaming memory methods.