Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors note that current video models struggle to keep updating their internal states when parts of a video are missing or interrupted. They propose ReMind, a new training approach that teaches models to use their memory more dynamically by showing them specially prepared videos with interruptions and changes. This method uses a detailed set of events and tricks that force the model to remember past moments rather than just relying on smooth video flow. Their approach improves performance on standard tests and avoids forgetting what was learned before. They also introduce a technical enhancement, PM-RoPE, to better handle spatial and temporal information efficiently.

video world modelshidden statesvideo diffusion transformersKV-cachedynamic memoryspatiotemporal retrievalSTEVO-Benchcurriculum trainingPM-RoPEcatastrophic forgetting

Authors

Tianshuo Xu, Yichen Xie, Depu Meng, Chensheng Peng, Quentin Herau, Bo Jiang, Yihan Hu, Wei Zhan

Abstract

Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.

View PDFOpen arXiv