Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion
2026-06-16 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors developed FR3D, a model that predicts how a 3D environment will change over time from a single camera view. Unlike earlier methods that mix camera and scene movement in flat images, FR3D separately models the 3D scene changes and the agent's movement, helping it keep objects stable and realistic. They also use a teaching method that borrows knowledge from existing models to improve predictions without extra training. Their tests show FR3D can reliably forecast 3D scene dynamics up to 2 seconds ahead in different settings.
3D reconstructiondynamic environmentsego-motionlatent representationworld modelsfuture predictionmonocular observationsteacher-student distillationgeometric consistencyzero-shot generalization
Authors
Nils Morbitzer, Jonathan Evers, Artem Savkin, Thomas Stauner, Nassir Navab, Federico Tombari, Stefano Gasperini
Abstract
Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial "common sense" of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: https://fr3d-wm.github.io.