Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors address a problem where multimodal large language models forget old skills when learning new visual tasks over time. They introduce Attention-Spectrum Regularization (ASR), which keeps track of how the model's attention patterns behave using compact spectral summaries, instead of saving old data or examples. This method helps the model remember useful cross-modal attention structures tied to previous skills while still learning new ones. Experiments show ASR improves performance and reduces forgetting compared to other methods. Their work provides a lightweight way to maintain important attention features during continual learning in multimodal models.

Multimodal large language modelsContinual learningForgettingCross-modal attentionSpectral analysisAttention mapsVisual question answeringRegularizationReplay-free methodsFourier power spectrum

Authors

Chuangxin Zhao, Canran Xiao, Siyuan Ma, Mengyao Lyu, Yanbiao Ma, Jun Xia, Guiguang Ding, Yang Liu

Abstract

Multimodal large language models (MLLMs) are increasingly required to adapt to non-stationary streams of visual domains, question types, and user instructions, yet continual fine-tuning often causes severe forgetting of previously acquired multimodal skills. Existing continual vision-language methods mainly preserve outputs, replay data or pseudo-data, regularize embedding geometry, or allocate task-specific parameters, but they provide limited control over how internal cross-modal attention patterns supporting old skills drift during adaptation. We propose Attention-Spectrum Regularization (ASR), a replay-free continual learning framework that preserves skill-conditioned structures of cross-modal attention. ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions instead of replaying past image-question pairs, generated pseudo-examples, or old-stage teacher snapshots. In later stages, a phase-invariant spectral regularizer constrains harmful drift of these prototypes while allowing instance-level attention to adapt to new tasks. We provide theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, and that Fourier power spectra are stable to spatial translations and bounded perturbations. Experiments on continual VQA and multimodal instruction-tuning benchmarks, including VQA v2, VQACL, CLT-VQA, CoIN, and UCIT, show that ASR consistently improves final performance and reduces forgetting over strong replay-, regularization-, and adapter-based baselines. Preserving skill-level attention structure is an effective and lightweight mechanism for continual MLLMs. Code is available at https://github.com/Creative-zcx/attention-spectrum-replay

View PDFOpen arXiv