RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

2026-05-25 • Machine Learning

Machine LearningComputation and Language

AI summaryⓘ

The authors address the challenge of improving large language models (LLMs) when adapting them to complex tasks involving different specialized knowledge. They focus on a method called Mixture-of-Experts (MoE), which uses multiple expert components to handle diverse data. The authors propose RotMoLE, a new way to enhance MoE by adding a rotation step that helps each expert better specialize beyond simple scaling. Their tests on complex and multilingual tasks show that this rotation mechanism helps the model learn more effectively, especially when the number of experts is limited.

Large Language ModelsMixture-of-ExpertsParameter-Efficient Fine-TuningMoE-LoRALow-rank adaptersGating mechanismsRotation gateMultilingual trainingMulti-task learning

Authors

Mengyang Sun, Maochuan Dou, Tao Feng, Dan Zhang, Yihao Wang, Junpeng Liu, Yifan Zhu, Jie Tang

Abstract

While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.

View PDFOpen arXiv