Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

2026-05-25Machine Learning

Machine LearningComputation and LanguageComputer Vision and Pattern Recognition
AI summary

The authors discuss a way to improve how multimodal large language models (MLLMs) are trained to handle new tasks over time, called Multimodal Continual Instruction Tuning (MCIT). They point out that existing approaches require changing the main model code, which makes it hard to reuse work and compare methods fairly. To fix this, the authors created Prism, a flexible codebase that lets researchers add new training methods as plugins without changing the core model. This setup makes it easier and faster to develop and test MCIT strategies while supporting large-scale training consistently.

Multimodal Large Language ModelsInstruction TuningContinual LearningPlugin ArchitectureCodebaseAlgorithm DevelopmentReproducibilityScalable TrainingMultimodal LearningMachine Learning Engineering
Authors
Jun-Tao Tang, Yu-Cheng Shi, Zhen-Hao Xie, Da-Wei Zhou
Abstract
Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT). Despite its growing importance, current MCIT research is hindered by severe engineering bottlenecks. Existing methods are typically implemented by directly modifying the base MLLM codebase, which imposes substantial implementation overhead and yields method-specific architectures that severely limit code reuse and fair comparison. To address this, we introduce Prism, a plug-in reproducible codebase specifically designed for scalable MCIT research. It separates algorithmic development from the backbone implementation via a lightweight plugin registration mechanism, enabling new strategies to be integrated as independent plugins without modifying the underlying MLLM codebase, thereby eliminating structural fragmentation and accelerating method development. Prism natively supports widely used large-scale training pipeline, thereby enabling reproducible and scalable MCIT experimentation. Code is available at https://github.com/LAMDA-CL/Prism.