Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

2026-04-03Robotics

RoboticsComputer Vision and Pattern Recognition
AI summary

The authors created a new robotic control method called MV-VDP that looks at the environment from multiple views over time, helping the robot understand 3D space and how things change when it acts. Instead of just using flat 2D images, their method predicts both heatmaps showing where to focus and videos showing what will happen next, making it easier and faster to learn. They tested this approach on different tasks and robots, and it worked well even with few examples and no extra training, outperforming other techniques. This method also helps explain robot decisions by showing expected future changes in the scene.

robotic manipulationmulti-view videodiffusion policy3D spatial understandingtemporal evolutionheatmap predictionvideo predictiondata efficiencyMeta-World benchmarkgeneralization
Authors
Peiyan Li, Yixiang Chen, Yuan Xu, Jiabing Yang, Xiangnan Wu, Jun Guo, Nan Sun, Long Qian, Xinghang Li, Xin Xiao, Jing Liu, Nianfeng Liu, Tao Kong, Yan Huang, Liang Wang, Tieniu Tan
Abstract
Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction--based, 3D-based, and vision--language--action models, establishing a new state of the art in data-efficient multi-task manipulation.