Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors study how to better match people using pictures (image-to-image) and descriptions (text-to-image) at the same time. They found that trying to train both tasks together directly causes problems because each task focuses on different details. To solve this, they created a two-step training method using one vision model that learns separately at first to avoid conflicts. Their experiments show that training on images first helps with text matching later, and adding text during training improves both tasks. This approach helps build a shared system for recognizing people across images and text.
person re-identificationimage-to-image retrievaltext-to-image retrievalmodality discrepancyshared representationvision encodercross-modal retrievalpre-trainingloss functionstraining pipeline
Authors
Karina Kvanchiani, Timur Mamedov
Abstract
The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.