CoVStream: Edge-Cloud Collaboration for Understanding of Long Video Streams
2026-06-22 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors point out that long video streams are usually analyzed either on powerful cloud servers or on limited edge devices, but both have problems like high data costs or weak processing. They created CoVStream, a system where the edge device simplifies the video into small important features and captions, sending less data to the cloud. The cloud then uses this information to think deeply only when needed, like answering user questions. Their tests show CoVStream saves a lot of bandwidth while keeping almost the same accuracy as cloud-only methods.
long video streamsedge computingcloud computingvideo feature extractionsemantic captionsbandwidth optimizationentity graphmultimedia intelligencevideo reasoning
Authors
Xu Liu, Guikun Chen, Zihao Yan, Kanzhi Wu, Wenguan Wang
Abstract
Long, continuous video streams are an increasingly critical driver of multimedia intelligence. Existing efforts often handle long videos with a sample-encode-reason approach using large models. However, they overlook a crucial deployment fact: the stream is often produced by computationally constrained devices. This forces an untenable compromise: cloud offloading unlocks strong reasoning but incurs prohibitive bandwidth overhead, while on-device processing remains limited by edge hardware capacity. Therefore, we propose CoVStream, the first edge-cloud collaborative framework for understanding long video streams. The edge node distills raw video streams into compact visual features and semantic captions for transmission to the cloud, minimizing bandwidth costs, while the cloud server integrates this data into an entity graph and global visual context, activating the heavy reasoning model only when a user query arrives. Experiments on VideoMME-Long, LVBench, and RTV-Bench show that CoVStream reduces bandwidth usage by 87.6% while retaining 99.2% of the cloud baseline accuracy on LVBench.