X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created X-Stream, the first test designed to check how well AI models understand video content from multiple streams at the same time, like watching several camera angles or devices together. They collected a large set of questions from many videos where understanding needs to come from more than one stream, and made sure models couldn't just rely on one view. They also tested current multi-modal language models and found these models struggle a lot with juggling multiple streams simultaneously. Their work highlights important challenges and offers a way to better evaluate future AI systems that need to handle multiple video inputs at once.

multi-stream video understandingmulti-modal large language modelsonline inferencesignal multiplexing theoryQA datasetsvideo benchmarksmulti-window scenariosmulti-view analysismulti-device interaction
Authors
Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo, Dongming Wu, Huankang Guan, Minghong Cai, Jinpeng Chen, Xintong Guo, Shuhan Li, Rui Liu, Xiangyu Yue
Abstract
While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning. To bridge this, we introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. Comprising 4,220 rigorously curated QA pairs across 932 videos, X-Stream evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios. Crucially, our dataset is constructed using a novel dual-verification pipeline that prevents over-reliance on a single stream. Furthermore, we pioneer the conceptualization of multi-modal large language models (MLLMs) as naive multiplexers, systematically evaluating their performance through the lens of Signal Multiplexing Theory. Our extensive online inference experiments reveal a stark reality: state-of-the-art MLLMs struggle significantly with concurrent streams, achieving only about 50% score and exhibiting poor proactive ability. Ultimately, X-Stream exposes the trade-off of current multiplexing schemes, providing both a practical evaluation protocol and empirical guidance for next-generation multi-stream agents.