GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address a problem in self-driving cars where sensors like cameras and LiDAR don’t perfectly align, causing confusion in understanding the environment. They created GraphBEV++, a system that fixes these alignment issues by using two modules: LocalAlign-v2 for local corrections and GlobalAlign-v2 for broader fixes. Their method works well with different types of sensor data and improves the car’s accuracy in detecting objects and estimating 3D space under both normal and noisy conditions. They tested it on several popular datasets and found it works better than existing methods in driving-related tasks like perception, prediction, and planning.

BEV perceptionLiDAR-camera calibrationmulti-modal fusiongraph matchingdeformable alignmentdiffusion denoising3D occupancy predictionautonomous drivingnuScenes datasetend-to-end driving systems

Authors

Ziying Song, Caiyan Jia, Lin Liu, Shaoqing Xu, Lei Yang, Yadan Luo

Abstract

Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.

View PDFOpen arXiv