StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors noticed that current methods for understanding videos with sound and images work well only when analyzing pre-recorded clips, not live streams. They created StreamOV, a system that can remember important past audio-visual information efficiently and decide the best moments to respond during a live video. To test this, they also built SOVBench, a new way to evaluate how well systems handle ongoing, multi-turn video interactions. Their experiments show StreamOV performs better than previous methods in both live and offline video understanding tasks.

omni-video understandingstreaming videoaudio-visual reasoninglong-short term memoryproactive response triggeringmulti-turn interactionbenchmarkonline video processingevidence-guided memory

Authors

Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu

Abstract

While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.

View PDFOpen arXiv