Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models
2026-06-15 • Computation and Language
Computation and LanguageArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors studied a way to make large language models with Mixture-of-Experts (MoE) architectures use less memory when training and running. They introduced 'Expert Tying,' which means sharing the same expert parameters in consecutive layers instead of having separate ones for each layer. Their tests showed this approach cuts memory use almost in half without hurting the model's accuracy or performance. This method takes advantage of repeated information inside MoEs to make training more efficient while keeping the model quality high.
Mixture-of-ExpertsLarge Language Modelstransformer layersparameter sharingmemory footprintperplexitymodel scalingroutingattention mechanismmodel pretraining
Authors
Martin Jaggi
Abstract
Mixture-of-Experts (MoE) architectures efficiently scale Large Language Models (LLMs) by activating only a small fraction of their experts per token, yet the full parameter count - dominated by the expert parameters - must be held in training and inference memory. To address this, we introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention. We evaluate this approach across common, state-of-the-art architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs. Our pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. By exploiting the parameter redundancy inherent in MoE pathways, our method provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs.