Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

2026-04-09Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors developed a new method called CG-CLIP to better identify people in videos, especially during challenging activities like sports or dance where many people look similar and move a lot. They use captions generated by advanced language models to help focus on unique features of each person and apply a smart way to gather information from the video efficiently. Their approach was tested on both common and newly created difficult datasets and showed better results than existing methods. This helps improve how computers match people across different cameras in tricky situations.

Person Re-IdentificationSpatiotemporal FeaturesCLIPCaption-guided Memory RefinementToken-based Feature ExtractionMulti-modal Large Language ModelsCross-attention MechanismMARS DatasetiLIDS-VID DatasetVideo Analysis
Authors
Shogo Hamano, Shunya Wakasugi, Tatsuhito Sato, Sayaka Nakamura
Abstract
In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.