Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

2026-03-19 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics

AI summaryⓘ

The authors found that video generation models, which create consistent video frames over time, naturally learn about 3D shapes and physical rules. They designed VEGA-3D, a tool that uses these video models to give large language models better understanding of space and geometry without needing detailed 3D data. By combining features from videos with language understanding, their method helps machines perform better on tasks involving 3D scenes and physical reasoning. Tests showed their approach beats existing methods that rely on explicit 3D information.

Multimodal Large Language Modelsvideo diffusion models3D scene understandingspatial reasoningphysical dynamicsgenerative modelslatent spacegeometric cuestoken fusionembodied manipulation

Authors

Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai

Abstract

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

View PDFOpen arXiv