World Pilot: Steering Vision-Language-Action Models with World-Action Priors

2026-06-10 • Robotics

Robotics

AI summaryⓘ

The authors introduce World Pilot, a new system that helps robots handle tasks involving seeing, understanding, and acting in the world. Unlike earlier models that learn from static images and text, World Pilot uses a model of how the world changes over time to better predict and plan robot movements. It combines two methods to guide the robot's actions: one that predicts how the scene will evolve and another that suggests likely movement paths. This approach improves the robot's success in various manipulation tasks, especially when facing new or changing environments. The authors tested it on benchmarks and real robots, showing better performance than previous methods.

Vision-Language-Action (VLA) modelsSemantic groundingWorld-Action Model (WAM)Latent SteeringAction SteeringScene evolutionZero-shot out-of-distribution (OOD)Robotic manipulationTrajectory planningBenchmark evaluation

Authors

Zefu Lin, Rongxu Cui, Junjia Xu, Xiaojuan Jin, Wenling Li, Lue Fan, Zhaoxiang Zhang

Abstract

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/

View PDFOpen arXiv