MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

2026-06-08 • Robotics

Robotics

AI summaryⓘ

The authors developed MotionWAM, a new approach for controlling humanoid robots in real time using video-based models. Unlike previous methods that separated upper and lower body control, their model predicts whole-body motions together from a single egocentric camera input. They trained the system in stages to adapt to the robot's view and movements, resulting in better performance on various tasks than older baseline methods. Their work shows it is possible to use video-based action models beyond simple tabletop tasks for more complex, coordinated robot control.

World Action Modelsvideo dynamics prioregocentric camerahumanoid locomotionpolicy conditioningmotion latentwhole-body controliterative denoisingUnitree G1Vision-Language-Action

Authors

Jia Zheng, Teli Ma, Yudong Fan, Zifan Wang, Shuo Yang, Junwei Liang

Abstract

World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body while a low-level controller tracks coarse base commands -- placing upper and lower body in inconsistent action spaces and reducing the legs to balance-preserving locomotion. We present MotionWAM, a real-time WAM that drives autonomous humanoid loco-manipulation from a single egocentric camera by conditioning the policy on the intermediate denoising features of a video world model. MotionWAM replaces the upper-lower split with a unified motion latent and predicts whole-body motion tokens that jointly cover locomotion, torso motion, height regulation, foot interaction, and hand manipulation in a single action space. A three-stage learning framework progressively adapts the video world model to egocentric visual dynamics and to the target humanoid embodiment. On nine real-world Unitree G1 tasks, MotionWAM runs in real time, substantially outperforms Vision-Language-Action (VLA) baselines fine-tuned on the same demonstrations by over 30% in overall success rate, and executes task-driven foot interaction that decoupled upper-lower policies cannot reach. Our results suggest that video-pretrained WAMs can be lifted from tabletop manipulation to coordinated, human-like whole-body humanoid control.

View PDFOpen arXiv