Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

2026-03-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics
AI summary

The authors developed Latent-WAM, a system for self-driving cars that plans better routes by using smarter ways to understand and predict the surroundings. They created two main parts: one that turns multiple camera views into simple, useful tokens that capture the scene’s geometry, and another that uses a Transformer model to predict how the world will change over time based on past visuals and movements. Their approach uses less data and a smaller model but still performs better than previous methods on popular driving simulations. This shows their system helps cars plan paths more accurately while being efficient.

autonomous drivingtrajectory planninglatent world representationsspatial encodingTransformerworld modelmulti-view imagesscene tokensautoregressive predictionNAVSIM
Authors
Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, Pengxuan Yang, Yihang Dong, Ce Hao, Xiaoqing Ye, Junyu han, Yifeng Pan, Dongbin Zhao
Abstract
We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.