Full-4D: Generating Full-Scope 4D Scenes from a Single-View Video

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the problem of creating full 3D videos over time (4D scenes) from just one camera view, which is normally very hard because one view doesn't show everything. They created a new approach that first imagines what multiple camera views would see, then builds a complete 4D model from those generated views. To do this, they made a large real-world dataset with many camera views, designed a special video generation method that respects 3D geometry and camera positions, and developed a way to turn the generated videos into a consistent 4D representation. Their experiments show this method works better than earlier ones at making realistic and geometrically accurate 4D scenes from single-view videos.

4D scene generationsingle-view videomulti-view video synthesisvideo diffusion modeltime-view attentioncamera conditioninggeometric reprojectionFlow Matching Distillation4D graph structure (4DGS)novel-view rendering
Authors
Tingxi Chen, Ke Hao, Yabo Chen, Zhengxue Cheng, Rong Xie, Li Song, Haibin Huang, Chi Zhang, Xuelong Li
Abstract
Generating 4D scenes from a single-view video is inherently ill-posed: a single viewpoint lacks the information needed to recover a complete, dynamic scene with full coverage. Existing methods are typically limited to monocular videos, simple 3D effects, or only small viewpoint perturbations around the original viewpoint, falling short of true 4D generation. Meanwhile, the lack of large-scale datasets capturing full-scope 4D scenes with synchronized multi-view videos further hinders progress in this direction. We propose a novel single-view video-to-4D framework that casts full-scope 4D generation as a multi-view video synthesis followed by optimization-based 4D reconstruction from the generated views. To instantiate this formulation end-to-end, we make three key contributions. First, we introduce Real-MV-4D, a large-scale dataset of synchronized multi-view videos captured in diverse real-world environments to provide the 4D supervision. Second, we train a multi-view video diffusion model driven by a novel fused time(T)-view(V) attention mechanism that directly embeds geometric reprojection priors and explicit camera conditioning into its view-time interactions. Unlike basic feature fusion, this direct binding strictly aligns the generation process with physical 3D priors to produce a dense, synchronized T$\times $V video grid. Third, rather than relying on non-interactive and inconsistent 2D video interpolations, we lift the synthesized multi-view videos into an explicit 4D representation (i.e. 4DGS), regularized by a Flow Matching Distillation loss that exploits the multi-view prior to improve novel-view rendering. Extensive experiments demonstrate that our method outperforms existing approaches in both visual fidelity and geometric consistency, enabling full-scope 4D scene generation from single-view videos.