SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors created SearchAD, a big dataset with over 423,000 images from autonomous driving that focus on finding very rare and important situations. They labeled more than 513,000 objects in 90 rare categories, some seen fewer than 50 times. Unlike older datasets, SearchAD helps computers learn to find images based on meanings and text descriptions, not just exact image matches. They tested different methods and found that systems using text to understand images worked better than those relying only on pictures, but even the best methods still struggle. SearchAD serves as a new resource for improving how self-driving cars detect rare events.

autonomous drivingrare event retrievalsemantic image retrievalmulti-modal retrievalfew-shot learningbounding boxeszero-shot learningdataset annotationlong-tail perception

Authors

Felix Embacher, Jonas Uhrig, Marius Cordts, Markus Enzweiler

Abstract

Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/

View PDFOpen arXiv