Intend, Reflect, Refine: An Adaptive Multimodal Reflection Framework for Autonomous Driving

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors present IRR-Drive, a new method for self-driving cars that improves decision-making by imagining and reflecting on the future before acting. Their system first creates a rough plan in words and predicts how the driving scene might change using a bird's-eye view map, then refines its plan based on this reflection. This approach helps the vehicle correct its intentions in complex situations and choose how much thinking to do depending on the difficulty of the scene. The authors show that their method performs very well on a driving benchmark and that their reflection process helps make better driving plans.

Vision-Language-Action modelstrajectory planningbird's-eye view (BEV)multimodal reflectionautonomous drivingsemantic predictionadaptive reasoningNAVSIM benchmarkplanning frameworkself-correction
Authors
Zisheng Chen, Yuping Qiu, Jianhua Han, Tao Tang, Xiuwei Chen, Likui Zhang, Ying-Cong Chen, Hang Xu, Xiaodan Liang
Abstract
Recent Vision-Language-Action (VLA) models have advanced end-to-end autonomous driving by incorporating reasoning for better interpretability and planning quality. However, most existing approaches directly generate the final trajectory without explicitly examining its future consequences, which limits their reliability in complex and dynamic environments. To address this limitation, we propose IRR-Drive (Intend, Reflect, Refine), an adaptive multimodal reflection framework for autonomous driving. Specifically, to tightly couple high-level reasoning with physical constraints, IRR-Drive first generates a preliminary textual intention and anticipates potential interactions by predicting future semantic bird's-eye view (BEV) representations. This dual-modality (Text + BEV) reflection space explicitly models anticipated scene evolution, enabling the model to rigorously self-correct and refine its initial intent before generating the final trajectory. Furthermore, to balance planning performance and computational efficiency, we construct reflection-oriented training data and design an adaptive reflection reward, enabling the model to adaptively select its reasoning mode according to scene complexity. Instead of using reasoning primarily as an auxiliary interpretation, IRR-Drive directly integrates an adaptive reflection mechanism into the planning framework, enabling grounded, decision-aware trajectory correction that is driven by scene complexity. Our method achieves state-of-the-art performance on the NAVSIM benchmark in both PDMS and EPDMS. Extensive experiments demonstrate the effectiveness of our multimodal reflection framework and validate the efficacy of the proposed adaptive reflection strategy.