LWDrive: Layer-Wise World-Model-Guided Vision-Language Model Planning for Autonomous Driving

2026-06-29 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors created a new method called LWDrive to improve how Vision-Language Models (VLMs) plan driving routes for autonomous cars. Instead of using the VLM's initial rough plan directly, their approach fine-tunes this plan step-by-step using extra information about the environment and future predictions. They also teach the model to think ahead by predicting future scenes, helping it make better decisions. Their method showed strong performance in driving benchmarks, suggesting it refines routes more accurately while keeping the original driving goals.

Vision-Language ModelsEnd-to-End Autonomous Drivingtrajectory planningworld modelForesight Cascade Plannerfuture-frame generationBird's-Eye-Viewmulti-view perceptionNAVSIM benchmarkcoarse-to-fine refinement

Authors

Chen Yang, Yuhao Wei, Ze Xu, Ziheng Zou, Shuang Liang, Delin Ouyang, Lingfeng Qi, Jie Li, Guofa Li

Abstract

Vision-Language Models (VLMs) provide powerful semantic understanding and commonsense reasoning for End-to-End Autonomous Driving (E2E-AD) planning. However, trajectories directly generated by VLMs often encode only coarse driving intentions and remain insufficient for geometrically accurate, future-aware, and multi-view-grounded planning. To address these limitations, we develop the Layer-Wise World-Model-Guided Driving framework (LWDrive). LWDrive is a VLM planning framework that refines coarse trajectories through layer-wise world-model guidance. Instead of treating the VLM output as the final trajectory, LWDrive uses it as an intent-aware coarse plan, expands a diverse candidate space around it, and progressively refines the candidates through a Foresight Cascade Planner (FCP). Specifically, we introduce future-frame generation supervision to encourage the VLM to learn forward-looking scene representations, thereby injecting planning-relevant predictive dynamics into its internal hidden states. Built upon these world-model-supervised representations, FCP exploits VLM features across multiple layers and integrates historical temporal states, Action-Query representations, and current-frame multi-view Bird's-Eye-View (BEV) features to refine candidate trajectories in a coarse-to-fine manner. This design enables progressive correction of spatial positions and motion trends while grounding trajectory refinement with multi-view scene cues and preserving the high-level driving intention produced by the large model. Finally, a score head evaluates the refined candidates and selects the best trajectory as the final planning output. Experiments show that LWDrive achieves a score of 92.0 on the NAVSIM benchmark and 89.6 on NAVSIM-v2. Code and models will be made publicly available.

View PDFOpen arXiv