Action Images: End-to-End Policy Learning via Multiview Video Generation

2026-04-07Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics
AI summary

The authors developed a new way for robots to learn tasks by turning their movements into special videos called action images, which show the robot’s arm from different views and link directly to the pixels in the video. This helps the robot use powerful video models to predict future steps without needing separate modules for actions. Their approach works well on both simulated and real robots and improves how the robot plans and understands actions together with videos. Overall, the authors show that using these pixel-based action images makes robot learning more effective and flexible.

world action modelspolicy learningvideo backboneaction representationpixel grounding7-DoF robot actionsmultiview video generationzero-shot policyRLBenchaction-conditioned video generation
Authors
Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan
Abstract
World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.