Structured 4D Latent Predictive Model for Robot Planning

2026-07-01 • Robotics

RoboticsComputer Vision and Pattern Recognition

AI summaryⓘ

The authors developed a new video prediction model that understands and predicts changes in a 3D scene over time, rather than just analyzing flat 2D videos. Their model creates a detailed and consistent 3D representation of the environment, which helps a robot plan and perform tasks more accurately. They showed that their approach improves the quality and consistency of future scene predictions and works well in real-world robot experiments. This leads to better performance on complex manipulation tasks and stronger generalization to new visual settings.

video prediction3D scene understandinglatent spacerobotic planninginverse dynamicsmulti-view coherencemanipulation tasksspatial reasoningtask generalizationstructured latent model

Authors

Zhiyi Li, Peilin Wu, Xiaoshen Han, Ruojin Cai, Yilun Du

Abstract

Video predictive models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, long-horizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency. We introduce a Structured 4D Latent Predictive Model, which predicts the evolution of a scene's 3D structure in a structured latent space conditioned on observations and textual instructions. Our representation encodes the scene holistically and can be decoded into diverse 3D formats, enabling a more complete and 3D consistent scene understanding. This structured 4D latent predictive model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics module. Experiments demonstrate that our model generates futures with strong visual quality, substantially better 3D consistency and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms. Our website is available at https://structured-4d-model.github.io/.

View PDFOpen arXiv