Unified Motion-Action Modeling for Heterogeneous Robot Learning

2026-06-15 • Robotics

Robotics

AI summaryⓘ

The authors introduce the UMA model, which links how objects move and how robots act by using 3D motion paths as a common language. Their method learns from different types of data without needing detailed task labels by predicting parts of motion sequences and separating task goals from scene details. Once trained, the same model can control robots, predict object dynamics, and adapt to new tasks using just a few examples. UMA was tested with robot demos, human videos, and simulations, beating specialized existing methods in these tasks.

visuomotor control3D object motion trajectoriesmasked generative objectivecontrastive learningmulti-task pretrainingfew-shot learningdynamics modelinghindsight relabelingtask adaptationrobot demonstrations

Authors

Yunhao Cao, Shitong Liu, Chao Feng, Meryl Zhang, Xuanchen Lu, Andrew Owens, Kuan Fang

Abstract

We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.

View PDFOpen arXiv