Fail2Drive: Benchmarking Closed-Loop Driving Generalization
2026-04-09 • Robotics
RoboticsComputer Vision and Pattern Recognition
AI summaryⓘ
The authors highlight that self-driving cars often struggle to handle new, different situations, especially when tested in simulators like CARLA. They created Fail2Drive, a new test set with pairs of driving routes that let researchers see exactly how changes in environment or behavior affect performance. Testing current models showed their success rates dropped by about 23% on these new challenges, revealing unexpected mistakes like ignoring important objects. They also provide open-source tools to help others create new test scenarios and check if solutions are possible. This work aims to help build more reliable self-driving systems by better measuring and understanding their generalization.
Closed-loop autonomous drivingDistribution shiftCARLA simulatorGeneralization benchmarkScenario testingLiDAR perceptionFree and occupied spaceSuccess rateOpen-source toolsPaired-route benchmark
Authors
Simon Gerstenecker, Andreas Geiger, Katrin Renz
Abstract
Generalization under distribution shift remains a central bottleneck for closed-loop autonomous driving. Although simulators like CARLA enable safe and scalable testing, existing benchmarks rarely measure true generalization: they typically reuse training scenarios at test time. Success can therefore reflect memorization rather than robust driving behavior. We introduce Fail2Drive, the first paired-route benchmark for closed-loop generalization in CARLA, with 200 routes and 17 new scenario classes spanning appearance, layout, behavioral, and robustness shifts. Each shifted route is matched with an in-distribution counterpart, isolating the effect of the shift and turning qualitative failures into quantitative diagnostics. Evaluating multiple state-of-the-art models reveals consistent degradation, with an average success-rate drop of 22.8\%. Our analysis uncovers unexpected failure modes, such as ignoring objects clearly visible in the LiDAR and failing to learn the fundamental concepts of free and occupied space. To accelerate follow-up work, Fail2Drive includes an open-source toolbox for creating new scenarios and validating solvability via a privileged expert policy. Together, these components establish a reproducible foundation for benchmarking and improving closed-loop driving generalization. We open-source all code, data, and tools at https://github.com/autonomousvision/fail2drive .