Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

2026-05-11Robotics

Robotics
AI summary

The authors created BehaviorBench, a new testing system to better evaluate how well reinforcement learning (RL) trained driving policies perform on standard driving benchmarks. They connected large-scale RL training to an established benchmark called nuPlan and found that existing tests are too simple, so they introduced a harder dataset from the Waymo Open Motion Dataset that requires complex multi-agent reasoning. They also tested driving policies against a variety of traffic behaviors instead of just one common model. Their results show that policies trained only by playing against similar opponents tend to overfit and struggle with new traffic behaviors, so the authors suggest combining RL policies with rule-based planners for better performance.

Reinforcement LearningAutonomous DrivingBenchmarkingnuPlanWaymo Open Motion DatasetMulti-agent reasoningSelf-playIntelligent Driver ModelPPO (Proximal Policy Optimization)Rule-based planner
Authors
Aron Distelzweig, Faris Janjoš, Andreas Look, Anna Rothenhäusler, Daniel Jost, Oliver Scheel, Raghu Rajan, Daphne Cornelisse, Eugene Vinitsky, Joschka Boedecker
Abstract
Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large-scale RL for driving on standardized evaluations unknown. We present BehaviorBench -- a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's standardized benchmarks are so simple that near-perfect scores are achievable by straight lane following with collision checking. We extract a meaningful, interaction-rich split from the Waymo Open Motion Dataset (WOMD) on which strong performance is impossible without multi-agent reasoning. Lastly, we address Behavior Diversity. Existing benchmarks commonly evaluate planners against a single rule-based traffic model, the Intelligent Driver Model (IDM). We provide a diverse suite of interactive traffic agents to stress-test policies under heterogeneous behaviors, beyond just using IDM. Overall, our benchmarking analysis uncovers the following insight: despite learning interactive behaviors in an emergent manner, policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors. Building on this observation, we propose a hybrid planner that combines a PPO policy with a rule-based planner.