AI summaryⓘ
The authors address a problem where big vision-language models like CLIP do not consistently perform well on special or less common image categories, especially when labeled examples are scarce. They created a method that uses just one labeled image per class and leverages a language model to make tricky alternative descriptions to test how well the vision-language model can tell the right one apart. By measuring this ability, they train a simple model to predict how accurate the vision-language model would be on new image tasks without needing lots of data. Their approach works well across different datasets, even ones from underrepresented regions, helping users decide if more labeling is worth it before spending effort. They also released their code and data to support further research.
Vision-Language Foundation ModelsCLIPZero-shot accuracyLarge Language ModelsCounterfactual descriptionsEmbedding spaceLinear regressionData annotationGlobal South datasetsDomain adaptation
Authors
Chris Vorster, Mayug Maniparambil, Noel E. O'Connor, Noel Murphy, Derek Molloy
Abstract
Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM's zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM's ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM's discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM's zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: https://github.com/chris-vorster/PreLabellingProbe.