AI summaryⓘ
The authors study vision-language models (VLMs) like CLIP in a task where the model must quickly learn from few examples in a new domain, called cross-domain few-shot learning (CDFSL). They find that fine-tuning these models on limited target data makes the model focus too much on easy-to-learn parts (tokens), a problem known as the attention sink, which harms its ability to tell classes apart. To fix this, the authors propose a method that dynamically adjusts the importance of different tokens during fine-tuning, encouraging the model to learn from harder but more informative tokens. Their approach improves performance on several benchmarks, showing it helps the model generalize better in new domains with little data.
Vision-language modelsCLIPCross-domain few-shot learningAttention sinkFine-tuningDomain adaptationTokensDiscriminabilityDynamic re-weighting
Authors
Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li
Abstract
Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at https://github.com/shuaiyi308/TIR.