CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and LanguageEmerging Technologies

AI summaryⓘ

The authors study how to teach a model to learn many tasks from different types of images one after another without forgetting earlier ones, even when it doesn't know which task it's doing. They improve on existing methods by using both the image features and the text descriptions from a popular vision-language model called CLIP. Their approach uses the text space to help decide the task, better measures confidence using multiple prototypes, and adjusts both image and text parts of the model to stay aligned. This method performs better on a large benchmark with many tasks, needing fewer trainable parameters than previous best approaches.

multi-domain learningtask-incremental learningCLIPvision-language modelstask routingcross-modal alignmentparameter-efficient learningprototype modelingGumbel gateszero parameter cost

Authors

Sriram Mandalika

Abstract

Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.

View PDFOpen arXiv