Beyond Uniform Experts: Cost-Aware Expert Execution for Efficient Multi-Device MoE Inference

2026-06-29Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing
AI summary

The authors address a problem in Mixture-of-Experts (MoE) language models where some parts (experts) use a lot of memory and bandwidth but don't contribute much to the final output, causing slowdowns. They identify two challenges: some experts waste resources despite being unimportant, and the slowest device in a multi-device setup limits overall speed. To fix this, they propose CAEE, a method that predicts the cost of running each expert and skips the expensive, low-value ones while adjusting the rest to keep accuracy high. Testing on a large model showed CAEE can speed up inference by 8-18% with less than a 1% drop in accuracy.

Mixture-of-Expertssparse activationinference latencytoken-level importancecost modelexpert pruningdata movement bottleneckmulti-device systems
Authors
Hui Zang, Pengfei Xia, Hong Liu, Jiajia Chu, Tuo Hao, Minghao Chen, Rui Zhang, Ziyang Zhang
Abstract
Mixture-of-Experts (MoE) architectures enable language models to achieve unprecedented scale via sparse activation. However, their inference performance is often limited by data movement bottlenecks. Two coupled challenges exacerbate this limtation: (1) Importance-Agnostic Cost: Low-contribution experts incur nearly uniform memory and transfer costs, resulting in a low cost-to-benefit ratio and wasting critical bandwidth; (2) System-Level Imbalance: Multi-device deployments are universally bottlenecked by the slowest device, meaning that local reductions on one device may yield no improvement in end-to-end latency. We propose Cost-Aware Expert Execution (CAEE), a hardware-guided runtime framework that jointly optimizes for token-level expert importance and system-level execution cost. CAEE uses lightweight, calibrated cost models to estimate hardware overhead, selectively prunes low-importance, high-cost experts, and redistributes their contributions via a low-overhead compensation mechanism, avoiding extra data movement. Evaluations on the 671B DeepSeek-R1 model show that CAEE can reduce end-to-end inference latency by 8\%-18\% across diverse deployment settings, including expert offloading and on-device execution on multi-device systems, while maintaining a model accuracy drop of less than 1\%.