VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

2026-04-23 • Robotics

Robotics

AI summaryⓘ

The authors developed VistaBot, a new system to help robots handle objects better even when camera angles change, without needing to adjust cameras during use. Their method combines geometry understanding and video models to predict actions from different views. They tested VistaBot on robots in simulations and real tasks, showing it performs better at seeing and acting from new angles. They also created a new score to measure how well a system works across different camera views.

robotic manipulationcamera viewpoint4D geometryview synthesisvideo diffusion modelsclosed-loop controllatent action learningcross-view generalizationaction-chunkingbenchmark metrics

Authors

Songen Gu, Yuhang Zheng, Weize Li, Yupeng Zheng, Yating Feng, Xiang Li, Yilun Chen, Pengfei Li, Wenchao Ding

Abstract

Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($π_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $π_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.

View PDFOpen arXiv