VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training
2026-06-03 • Robotics
RoboticsArtificial Intelligence
AI summaryⓘ
The authors address challenges in training vision-language-action models for robot manipulation from wrist-mounted fisheye camera data and imperfect human demonstrations. They create VISTA, which includes a special question-answering dataset (UMI-VQA) to help models understand distorted fisheye images, a physical-validation process to ensure robot movement data is realistic and safe, and a training method combining these elements. Their experiments show that this approach improves robot control performance in simulations and real-world tasks. They also share their datasets and tools with the community for further research.
Universal Manipulation InterfaceVision-Language-Action modelswrist-mounted fisheye cameraradial distortionphysical-validationdata-completenesstrajectory continuityself-collisionrobot teleoperationco-training
Authors
Siyuan Yang, Linzheng Guo, Ouyang Lu, Zhaxizhuoma, Daoran Zhang, Xinmiao Wang, Ting Xiao, Fangzheng Yan, Zhijun Chen, Yan Ding, Chao Yu, Chenjia Bai, Xuelong Li
Abstract
Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $π_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.