Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionMachine Learning
AI summaryⓘ
The authors address a problem with vision-language models like CLIP, which sometimes rely too much on background clues instead of the actual object in an image. They found that the way these models represent images and text causes common features to cluster tightly while rare but important details scatter, leading to biased predictions. To fix this, the authors propose a method called Density-Aware Translation that adjusts similarity scores based on how densely data points cluster, reducing overconfidence in misleading matches. Their experiments show this approach improves accuracy and makes predictions more reliable without needing extra fine-tuning.
Vision-Language ModelsCLIPZero-shot ClassificationSpurious CorrelationsEmbedding SpaceAnisotropySimilarity ScoresFeature DensityPrompt Engineering
Authors
Afsaneh Hasanebrahimi, Hanxun Huang, Christopher Leckie, Sarah Erfani
Abstract
Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.