Is Your Trajectory Displacement Safe in Long-tail?

2026-06-15Robotics

RoboticsArtificial Intelligence
AI summary

The authors address the challenge of evaluating self-driving car planning in rare, tricky situations that are hard to test. They treat evaluation like spotting new safety problems introduced by a car’s planned path compared to an expert’s. Their method, FluidTest, uses a clear human review system, a detailed list of potential safety threats, and multiple layers of checks to ensure accuracy. Tests showed that even top self-driving systems can still make unsafe moves, which simpler scores might miss. This approach helps better find and understand risks in autonomous driving plans.

autonomous drivingplanning evaluationlong-tail scenariossafety threatshuman annotationverification systemtrajectory displacementAverage Displacement ErrorWOD-E2E datasetclosed-loop metrics
Authors
Qiao Sun, Weicheng Zheng, Yixin Huang, Hang Zhao
Abstract
Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner's displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at https://fluidtest.web.app.