Unveiling High-Probability Generalization in Decentralized SGD

2026-05-11Machine Learning

Machine Learning
AI summary

The authors study a method called decentralized stochastic gradient descent (D-SGD), which is used to train machine learning models across many computers. They notice that current guarantees about how well these models learn aren’t as strong when multiple workers are involved compared to when there is just one. To fix this, they develop new theory that gives stronger, high-confidence guarantees for D-SGD, matching the best-known rates for the simpler case of one worker. Their work covers different types of problems, including those with convex and non-convex loss functions, and also takes into account communication delays between workers. Overall, their results improve understanding of how well D-SGD works in real-world distributed settings.

Decentralized Stochastic Gradient DescentGeneralization BoundsHigh-Probability GuaranteesUniform StabilityConvex OptimizationStrongly ConvexNon-Convex OptimizationExcess RiskDistributed LearningCommunication Overhead
Authors
Jiahuan Wang, Ping Luo, Ziqing Wen, Dongsheng Li, Tao Sun
Abstract
Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to $\mathcal{O}\left(\frac{1}{δ\sqrt{mn}}\right)$, where $δ$ is the confidence parameter, $m$ the number of workers, and $n$ the sample size. When $m=1$, D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is $\mathcal{O}\left(\frac{1}{\sqrt{n}}\log (1/δ)\right)$. This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal $\mathcal{O}\left(\frac{1}{\sqrt{mn}}\log (1/δ)\right)$ rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.