Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
2026-03-26 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors identify a problem in video models where moving objects that disappear and reappear are often lost or distorted. They propose a new approach called Hybrid Memory, which helps the model remember static backgrounds while keeping track of moving subjects even when they go out of view. To test this, they created a large video dataset called HM-World with complex scenes and movements. They also developed a new memory system, HyDRA, that focuses on relevant motion details to keep hidden objects consistent. Their experiments show that their method works better than existing ones at keeping moving things looking right in videos.
video world modelsmemory mechanismsdynamic subjectshybrid memoryspatiotemporal relevancemotion continuityvideo datasetobject trackingHyDRAgenerative modeling
Authors
Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai
Abstract
Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.