Scheduling Parallel Optical Circuit Switches for AI Training
2026-03-07 • Networking and Internet Architecture
Networking and Internet ArchitectureArtificial Intelligence
AI summaryⓘ
The authors study how to efficiently manage traffic in data centers using multiple parallel optical circuit switches, which can save energy but take time to reconfigure. They created an algorithm called Spectra that breaks down the traffic demand into simpler parts, smartly assigns these parts across the switches, and then balances the workload to minimize total completion time. Tested on real AI workloads like GPT and MoE models, Spectra performs much better than current methods, finishing tasks faster while coming close to theoretical speed limits. This approach helps improve the efficiency of AI training by reducing traffic delays in optical networks.
Optical Circuit Switch (OCS)Datacenter TrafficAI Training WorkloadsScheduling AlgorithmReconfiguration DelayPermutation DecompositionMakespan MinimizationLoad BalancingGPT ModelMixture of Experts (MoE)
Authors
Kevin Liang, Litao Qiao, Isaac Keslassy, Bill Lin
Abstract
The rapid growth of AI training has dramatically increased datacenter traffic demand and energy consumption, which has motivated renewed interest in optical circuit switches (OCSes) as a high-bandwidth, energy-efficient alternative for AI fabrics. Deploying multiple parallel OCSes is a leading alternative. However, efficiently scheduling time-varying traffic matrices across parallel optical switches with non-negligible reconfiguration delays remains an open challenge. We consider the problem of scheduling a single AI traffic demand matrix $D$ over $s$ parallel OCSes while minimizing the makespan under reconfiguration delay $δ$. Our algorithm Spectra relies on a three-step approach: Decompose $D$ into a minimal set of weighted permutations; Schedule these permutations across parallel switches using load-aware assignment; then Equalize the imbalanced loads on the switches via controlled permutation splitting. Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor of $1.4\times$ on GPT AI workloads, $1.9\times$ on MoE AI workloads, and $2.4\times$ on standard benchmarks. Further, the makespans achieved by Spectra consistently approach newly derived lower bounds.