Don't Let a Few Network Failures Slow the Entire AllReduce
2026-06-01 • Distributed, Parallel, and Cluster Computing
Distributed, Parallel, and Cluster ComputingMachine LearningNetworking and Internet Architecture
AI summaryⓘ
The authors studied how network problems slow down communication between GPUs during training tasks. They found a theoretical limit on how fast data can be shared when one GPU has lower network speed, showing the slowdown can be small if the reduced speed stays above half. To get closer to this limit, they created a new method called OptCC that organizes data sharing in four steps. Tests show OptCC is much faster than current methods when networks have partial failures, working almost as well as if there were no problems.
GPU clustersAllReducecollective communicationnetwork failuresbandwidth asymmetryNCCLpipeliningfault tolerancering algorithm
Authors
Peiqing Chen, Jiedong Jiang, Nengneng Yu, Yuefeng Wang, Sixian Xiong, Wei Wang, Zaoxing Liu
Abstract
Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing the entire collective. We present the first information-theoretic lower bound on AllReduce completion time under asymmetric network bandwidth and show that when the straggler retains at least half of its original bandwidth, the unavoidable overhead relative to the fault-free optimum is only O(1/p) for p GPUs. We then design OptCC, a four-stage pipelined AllReduce algorithm that approaches this lower bound. Experiments on SimAI confirm that OptCC closes the gap left by existing fault-tolerant schemes: under practical network failures with up to 50% bandwidth loss, OptCC completes AllReduce within 2-6% of NCCL's fault-free ring performance, whereas the state-of-the-art incurs up to 57% overhead.