Driving Video Retrieval for Complex Queries with Structured Grounding
2026-06-08 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionInformation RetrievalMachine Learning
AI summaryⓘ
The authors address the challenge of finding specific actions or events, like sudden braking, in autonomous driving videos. They note that current methods using text descriptions or simple rules often miss these because the events aren't always clearly described or rules don't fit real data well. Their new system, STRIVE-D, learns from example videos to adjust and combine rules with other search methods for better accuracy. Tests on several driving datasets show their approach works much better than previous methods.
video retrievalautonomous drivingvision-language modelsrule-based retrievalweak supervisionevent detectiondata calibrationDrivingDojo dataset
Authors
Manyi Yao, Sparsh Garg, Christian Shelton, Amit Roy-Chowdhury, Abhishek Aich
Abstract
Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.