Towards Generalizable Robotic Manipulation in Dynamic Environments

2026-03-16Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics
AI summary

The authors identify that current vision-language-action (VLA) models work well with still objects but have trouble handling tasks involving moving objects because they mostly use single images and lack dynamic data. To fix this, they created DOMINO, a large dataset with many tasks and expert examples focused on dynamic manipulation. They also propose PUMA, a new model that uses past motion information to predict near-future object states, improving performance on these dynamic tasks. Their experiments show that training with dynamic data not only helps with moving targets but also benefits static tasks.

Vision-Language-Action modelsDynamic manipulationSpatiotemporal reasoningOptical flowDataset benchmarkExpert trajectoriesShort-horizon predictionGeneralizabilityHierarchical task complexityTransfer learning
Authors
Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai
Abstract
Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.