MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceRobotics

AI summaryⓘ

The authors propose a new method called MTA-RL to improve self-driving cars' understanding of their surroundings and decision-making. Instead of directly predicting actions from raw images and LiDAR data, their method uses a transformer to combine these inputs into clear, 3D representations called affordances. These affordances help the reinforcement learning system learn faster and make safer driving decisions. Testing in a driving simulator showed their approach works better than others, even in new environments the system wasn’t trained on. They also found that combining different sensors and designing rewards carefully are important for strong performance.

Autonomous driving3D scene understandingReinforcement learningTransformer architectureMulti-modal fusionAffordance representationLiDARRGB imagesZero-shot generalizationCARLA simulation

Authors

Guangli Chen, Dianzhao Li, Wenjian Zhong, Bangquan Xie, Ostap Okhrin

Abstract

Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying densities (20-60 background vehicles) show that MTA-RL consistently outperforms state-of-the-art baselines. Trained solely on Town03, our method demonstrates superior zero-shot generalization in unseen towns, achieving up to a 9.0% increase in Route Completion, an 11.0% increase in Total Distance, and an 83.7% improvement in Distance Per Violation. Furthermore, ablation studies confirm that our multi-modal fusion and reward shaping are critical, significantly outperforming image-only and unshaped variants, demonstrating the effectiveness of MTA-RL for robust urban autonomous driving.

View PDFOpen arXiv