Vera: A Layered Diffusion Model for Content-Preserving Video Editing

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed Vera, a new method for editing videos that keeps important parts of the original video unchanged. Instead of changing every pixel, Vera creates a separate edit layer that is combined with the original video, which helps preserve elements like characters and backgrounds. They designed a special model structure that allows the editing layer and the original video to work together smoothly. They also created a new dataset with clear layers to help train their method. Tests showed Vera keeps original content better than other video editing tools while still producing good edits.

video diffusion modelscontent preservationedit layeralpha mattetext-to-videoMixture-of-Transformersself-attentionlayered datasetvideo editingDiT (Diffusion Transformer)
Authors
Hongkai Zheng, Ta-Ying Cheng, Benjamin Klein, Yisong Yue, Zhuoning Yuan
Abstract
Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera, a layered diffusion framework for content-preserving video editing. Instead of regenerating the entire video, Vera generates an edit layer along with an alpha matte for compositing with the source video, separating creative editing from content preservation by design. To encourage coherent composition with the source video, we extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture, with separate DiTs for each layer that interact through joint self-attention. To support the training of Vera, we further construct a high-quality layered dataset with accurate alpha mattes, diverse scenes and dynamics, and visual effects. Across our quantitative benchmark and human preference study, Vera outperforms leading open-source video editing models in content preservation while remaining competitive in edit quality, using 486K frames of layered training data.