Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
2026-04-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionComputation and LanguageMachine Learning
AI summaryⓘ
The authors study how to improve large multimodal models that work with both images and text, especially when adding image generation tasks. They find that existing methods split the model in ways that reduce synergy between tasks. Their solution, Symbiotic-MoE, keeps experts focused on specific tasks but shares some experts to connect modalities better, allowing the model to use knowledge from image generation to improve text understanding. They also use a special training method to protect early learning and help the model gradually combine image and text skills. Their experiments show better performance on understanding tasks without losing generation ability.
Large Multimodal ModelsCatastrophic ForgettingMixture-of-Experts (MoE)Gradient ConflictModality-Aware Expert DisentanglementCross-modal SynergyProgressive Training StrategyGenerative TasksTextual RepresentationMMLU and OCRBench
Authors
Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong, Liefeng Bo, Ping Tan
Abstract
Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.