FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

2026-04-29 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingMachine Learning

AI summaryⓘ

The authors address the challenge of running large Mixture-of-Experts (MoE) models, which usually require keeping all expert parts in memory even if only some are used. They propose FaaSMoE, which uses cloud functions (Function-as-a-Service) to run only needed experts on demand, saving resources especially when multiple users share the system. Their approach separates control from execution and allows tuning how experts are grouped to balance speed and efficiency. They tested this system and found it uses much less memory compared to running all experts at all times. This work shows a new way to serve big MoE models more efficiently in shared environments.

Mixture-of-ExpertsFunction-as-a-Servicemulti-tenantmodel servingstateless functionsscalabilityresource efficiencyedge computing

Authors

Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi, David Bermbach

Abstract

Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and evaluate it using Qwen1.5-moe-2.7B under multi-tenant workloads. Compared to a full-model baseline, FaaSMoE uses less than one third of the resources, demonstrating a practical and resource-efficient path towards scalable MoE serving in a multi-tenant environment.

View PDFOpen arXiv