Memento: Reconstruct to Remember for Consistent Long Video Generation
2026-06-12 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors propose Memento, a new way to make long videos where the same characters or objects look consistent throughout many scenes. Unlike previous methods that mostly guess what should happen next, Memento keeps a memory of important details about subjects to help reconstruct their appearance accurately. It uses two types of memory queries: one to remember who the subject is and another to keep the story flowing smoothly. Their method also uses detailed descriptions to teach the system how to keep subjects looking the same. Tests show Memento does better at keeping characters consistent and making videos look good over a long time.
long-form video generationtemporal decompositionsubject consistencyautoregressive generationmemory banksubject reconstructiondual-query memoryidentity groundingstory captionscinematic data pipeline
Authors
Xuan Wei, Longbin Ji, Guan Wang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Qingqi Hong
Abstract
Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.