DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors address the challenge of updating knowledge in Vision Language Models (VLMs) without messing up what the model already knows. They propose a method called Dynamic Subspace Concept Alignment (DSCA) that separates concepts into distinct parts of the model’s understanding, so updates don’t interfere with each other. By doing this, their method makes sure edits are precise, keeps the model accurate, and maintains good connections between vision and language. Their experiments show that DSCA works well even after many updates and keeps unwanted mistakes low.

Vision Language ModelsModel EditingContinual LearningCatastrophic ForgettingRepresentation SpaceOrthogonal SubspacesPCA (Principal Component Analysis)Cross Modal AlignmentBackward TransferKnowledge Retention

Authors

Gyanendra Das, Sai Satyam Jena

Abstract

Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.

View PDFOpen arXiv