Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study
2026-06-08 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionMachine Learning
AI summaryⓘ
The authors studied how to identify vehicles, pedestrians, and cyclists in self-driving car videos using descriptions made by Vision-Language Models instead of just looking at images. They used detailed text descriptions based on things like color, shape, and pose to match objects across different views without extra training. Their method worked almost as well as traditional image-based models and was easier to understand because it used clear attributes. However, the authors found challenges such as inconsistent descriptions from different angles and trouble telling apart very similar-looking objects.
Re-IdentificationVision-Language ModelsZero-shot learningAutonomous drivingAppearance embeddingsSemantic attributesObject detectionVisual matchingInterpretabilityCross-view matching
Authors
Eduardo Borges, Manuel Abreu, Luís Garrote, Urbano J. Nunes
Abstract
Re-Identification (ReID) in autonomous driving is typically formulated as a visual matching problem, where observations of vehicles, pedestrians, and cyclists are associated across time, frames, or camera views using learned appearance embeddings, often complemented by motion, geometric, or multimodal cues. However, purely visual representations may be sensitive to viewpoint, occlusion, illumination, and sensor-domain variations, limiting their interpretability and robustness in complex driving scenes. We propose a baseline study of a zero-shot pipeline using Vision-Language Models (VLMs) to generate textual descriptions of detected traffic participants and evaluate whether these descriptions can support identity matching across observations. Instead of relying only on low-level visual similarity, the proposed formulation represents each object through structured semantic attributes, including category, color, shape, pose, visible parts, spatial context, and distinctive visual cues. This study provides an initial benchmark for language-based re-identification in autonomous-driving scenarios, discussing and evaluating the strengths and limitations of current VLMs for this task. Results demonstrate that zero-shot semantic descriptions can support effective object re-identification, achieving retrieval performance comparable to a supervised CNN baseline while offering greater interpretability through explicit identity cues. However, the experiments also reveal important challenges, including attribute inconsistency across viewpoints and limited fine-grained discrimination between visually similar instances.