ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

2026-06-01Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors study Mixture-of-Experts (MoE) models that choose a few specialized parts (experts) to handle each piece of data. They note that picking these experts is tricky because it involves making discrete choices that are hard to train with usual gradient methods. To solve this, the authors propose ProbMoE, a method that treats expert selection as a probability problem, allowing gradients to flow more smoothly during training. They show that their approach improves how experts are used and can adjust the number of experts per data token without losing performance.

Mixture-of-ExpertsTop-k routingGradient estimationProbabilistic inferenceDiscrete optimizationDynamic routingExpert utilizationRouting diversityNeural network training
Authors
Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng
Abstract
Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact-$k$ routing, which samples $k$-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-$k$ routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-$k$ achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic-$k$ achieves comparable performance with fewer activated experts.