UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

2026-06-02 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingMachine Learning

AI summaryⓘ

The authors address the problem of uneven work distribution in large Mixture of Experts (MoE) models during training and serving, which can slow down computation and increase memory use. They introduce UltraEP, a new system that rebalances the workload in real-time for every small batch and layer, improving efficiency on rack-scale nodes. UltraEP uses a combination of quick planning and smart communication to minimize overhead while maintaining balance. In tests on very large models, it significantly improves throughput and reduces imbalance compared to no balancing, and it scales well to thousands of GPUs in production.

Mixture of Experts (MoE)Expert ParallelismLoad BalancingRack-Scale Node (RSN)MicrobatchToken All-to-AllActivation MemoryThroughputGPU ScalePost-Gating Load

Authors

Xinming Wei, Chao Jin, Tuo Dai, Yinmin Zhong, Shan Yu, Chengxu Yang, Bingyang Wu, Zili Zhang, Jing Mai, Qianchao Zhu, Zhouyang Li, Yuliang Liu, Guojie Luo

Abstract

Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns. We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\times$ improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04. Additionally, we validate UltraEP's scalability and robustness in production MoE training with 2560 GPUs.

View PDFOpen arXiv