SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling
2026-06-15 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors look at ways to make large Mixture-of-Experts (MoE) models easier to train by turning already trained regular models into MoE models, a process called upcycling. They noticed existing methods didn't work well when there wasn't much training data or when all expert parts were too similar. To fix this, they created SPRI, which smartly splits parts of the original model to keep important info while giving each expert some unique traits. They tested this on translating speech into many languages and found their method made translations better than both regular models and previous MoE upcycling techniques. Their approach also uses a special two-step training to keep things stable.
Mixture-of-Experts (MoE)Model upcyclingSparse modelsSingular Value Decomposition (SVD)Feed-Forward Network (FFN)Speech-to-text translationMultilingual adaptationBLEU scoreCOMET scorePretrained weights
Authors
Weiqiao Shan, Ruixiang Mao, Yuang Li, Yuhao Zhang, Yingfeng Luo, Tong Zheng, Chen Xu, Yucheng Qiao, Chunxiang Jin, Yi Yuan, Jingdong Chen, Tong Xiao, Jingbo Zhu
Abstract
Mixture-of-Experts (MoE) models enable efficient scaling, but training them from scratch remains prohibitively expensive. MoE upcycling mitigates this cost by converting pretrained dense models into sparse MoE models. However, existing upcycling methods typically rely on large-scale continued training and often perform poorly under data-constrained supervised adaptation, due to either homogeneous experts or overly disruptive perturbations to pretrained parameters. In this setting, effective upcycling must leverage pretrained weight structure while introducing sufficient diversity among routed experts. To this end, we propose SVD-Partitioned Residual Initialization (SPRI), which distributes SVD-partitioned residuals derived from pretrained feed-forward network (FFN) weights across routed experts, introducing controlled expert diversity grounded in pretrained spectral structure. We further introduce a two-stage training strategy to improve adaptation stability. We evaluate SPRI on multilingual speech-to-text translation, where limited supervised data challenges MoE upcycling and multiple target languages provide natural routing heterogeneity. On CoVoST2 across 15 En-to-XX directions, SPRI improves average BLEU and COMET over fully fine-tuned dense models by 2.58 and 3.32 points, respectively, and outperforms the prior best MoE upcycling baseline by 3.39 BLEU and 4.34 COMET points.