[CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors found that current vision-language models struggle to recognize multiple objects in one image because they rely on a single overall image representation. To fix this, they created a method called PIAA that looks at smaller parts of the image separately and then combines these results intelligently. This approach improves recognizing several objects at once without needing to retrain the model. Tests showed their method works better than earlier ones on a tough dataset, and it runs with little extra computation.

Vision-Language ModelsCLIPMulti-label RecognitionPatch-level InferenceAdaptive AggregationSemantic EntanglementUnsupervised Visual ClassifierTraining-free MethodsMean Average Precision (mAP)NUS-WIDE Dataset
Authors
Akang Wang, Xili Deng, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li
Abstract
Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns. To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision-language modality gap. We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines. Code is available at https://github.com/akang-wang/PIAA.