Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the problem of keeping video frames geometrically consistent over a long time when generating videos one step at a time. They propose a method called COVRAG that uses depth information and pretrained 3D knowledge to better decide which past frames to remember and use when creating new frames. This approach focuses on covering parts of the scene that haven't been well seen before, making the video more consistent without being too slow. They tested their method and found it works better than other approaches on long videos while running efficiently.

autoregressive video generationgeometric consistencymemory-augmented generative models3D priorsdepth estimationframe retrievalcoverage mapsliding-window cachingRealEstate10KDL3DV10K
Authors
Minseok Joo, Dogyun Park, Taehoon Lee, Kyujin Lee, Hyunwoo J. Kim
Abstract
Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key design choices: what 3D-geometric evidence should represent past observations, and how memory frames should be selected from this evidence. Existing methods often rely on camera poses or field-of-view overlap, which are lightweight but too coarse to reason about pixel-wise visibility, or use explicit 3D reconstruction, which provides fine-grained evidence but is costly to maintain over long rollouts. We propose Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a depth-based memory retrieval framework that uses pretrained 3D priors to construct a target-view coverage map as lightweight 3D memory evidence. For frame selection, COVRAG maximizes residual coverage gain, iteratively retrieving frames that explain target-view regions not covered by the current context or previously selected memories. To improve scalability in long-video generation, we introduce sliding-window depth caching for efficient geometry estimation. Experiments on RealEstate10K and DL3DV10K show that COVRAG improves long-horizon geometric consistency while maintaining low latency compared to baselines.