Demystifying Pipeline Parallelism: First Theory for PipeDream

2026-06-02 • Machine Learning

Machine LearningDistributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors study a method called pipeline model parallelism, which helps spread computation for training machine learning models across many devices. They introduce a new theoretical approach, Randomized PipeDream, that shows how this method can reliably work even with complex, non-simple problems. They analyze how delays in this pipeline approach grow with more stages and compare it to another method called LocalSGD, which averages model updates to reduce waiting times. Their experiments show that which method works better depends on the specific training problem and setup.

pipeline model parallelismdata parallelismtensor parallelismstale gradientnonconvex optimizationPipeDreamRandomized PipeDreamLocalSGDmodel averagingstaleness

Authors

Ivan Ilin, Peter Richtárik

Abstract

Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018). Our first contribution is theoretical: we introduce Randomized PipeDream (RPD), a stale block-SGD abstraction that yields, to our knowledge, the first clean nonconvex convergence guarantee for a PD-style method. Our second contribution is a scaling diagnosis: we prove that the delay induced by steady-state PD grows as $S^2 - S/2 + O(1)$ for $S$ stages, so the stale-read contribution in the convergence theorem scales as $Θ(γ^2 S^4)$, equivalently as $Θ(S^4/K)$ in the tuned-rate form. Our third contribution is a comparison with LocalSGD, whose periodic model averaging trades weight staleness for synchronization bubbles. In our reported simulated-time experiments, the better-performing method depends on the objective: PD performs better on the quadratic objective and on a small language-modeling training-loss task, while for logistic regression LocalSGD becomes superior as the number of stages increases.

View PDFOpen arXiv