DOT-MoE: Differentiable Optimal Transport for MoEfication

2026-06-01 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors address the challenge of making very large language models more efficient during use by turning dense parts of the model into smaller expert parts. They propose a new method called DOT-MoE that uses a mathematical approach called Differentiable Optimal Transport to more carefully assign neurons to these experts, rather than using random or simple clustering methods. Their method also learns how to route information through the experts automatically and all together during training. Tests show this approach keeps most of the original model's accuracy while using only half the active parts, making inference more efficient.

Large Language ModelsMixture of ExpertsInference EfficiencyFeed-Forward NetworkDifferentiable Optimal TransportSinkhorn-Knopp IterationsStraight-Through EstimatorNeuron AssignmentToken RoutingStructured Pruning

Authors

Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig, Deepak Gupta

Abstract

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

View PDFOpen arXiv