MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
2026-06-05 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors present MemDreamer, a new method to help vision-language models understand very long videos without getting overwhelmed. Instead of looking at all the video frames at once, their system builds a layered memory graph that summarizes important parts and relationships over time. When asked questions, the model uses a step-by-step process to explore this memory rather than trying to remember everything at once. Their tests show MemDreamer works better than previous methods, almost matching human expert performance while only needing to consider a small part of the whole video. They also found that improving logic skills helps the model understand long videos better.
Vision-Language ModelsLong-video understandingHierarchical Graph MemoryAgentic explorationToken explosionAttention dilutionSpatiotemporal relationsObservation-Reason-Action loopLogic reasoningMultimodal comprehension
Authors
Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen
Abstract
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.