PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present PermaVid, a new system designed to keep videos looking consistent even after changes are made to scenes or layouts. Their method separates what things look like (appearance) from their shape and position (geometry) in memory, updating this information carefully to stay accurate over time. They use two types of memory banks: one for color and appearance, and one for depth and structure, which helps keep the video stable after edits. Tests show PermaVid is better at maintaining consistent videos than previous methods.

video generationmemory networkssemantic appearancegeometric structuremulti-modal contextdepth memoryRGB memorylong-term consistencyvideo editingfeature fusion
Authors
Shuai Yang, Bingjie Gao, Ziwei Liu, Jiaqi Wang, Dahua Lin, Tong Wu
Abstract
Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.