Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

2026-05-13Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors studied how to improve continual learning by making better use of detailed image parts that current methods overlook. They use a system called SPA to connect small image patches with descriptive text, helping the model learn important features like a rabbit's ears or tail. This patch-level alignment allows the model to recognize things more accurately and remember old classes better over time. They also add special tools to adjust the model for new tasks and reduce forgetting of previous knowledge. Their experiments show that this approach works better than previous ones.

Class-Incremental LearningCLIPVision-Language ModelsPatch-Level FeaturesSemantic AlignmentOptimal TransportCatastrophic ForgettingTask-Specific ProjectorsPseudo-Features
Authors
Hao Sun, Zi-Jun Ding, Da-Wei Zhou
Abstract
Class-Incremental Learning (CIL) enables models to continuously integrate new knowledge while mitigating catastrophic forgetting. Driven by the remarkable generalization of CLIP, leveraging pre-trained vision-language models has become a dominant paradigm in CIL. However, current work primarily focuses on aligning global image embeddings (i.e., [CLS] token) with their corresponding text prompts (i.e., [EOS] token). Despite their good performance, we find that they discard the rich patch-level semantic information inherent in CLIP's encoders. For instance, when recognizing a rabbit, local patches may encode its distinctive cues, such as long ears and a fluffy tail, which can provide complementary evidence for recognition. Based on the above observation, we propose SPA (Semantic-guided Patch-level Alignment) for CLIP-based CIL, which aims to awaken long-neglected local representations within CLIP. Specifically, for each class, we first construct representative and diverse visual samples and feed them to GPT-5 as visual guidance to generate class-wise semantic descriptions. These descriptions are used to guide the selection of discriminative patch-level visual features. Building upon these selected patches, we further employ optimal transport to align selected patch tokens with semantic tokens from class-wise descriptions, yielding a structured cross-modal alignment that improves recognition. Furthermore, we introduce task-specific projectors for effective adaptation to downstream incremental tasks, and sample pseudo-features from stored class-wise Gaussian statistics to calibrate old-class representations, thereby mitigating catastrophic forgetting. Extensive experiments demonstrate that SPA achieves state-of-the-art performance.