Piper: A Programmable Distributed Training System

2026-06-09 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingArtificial Intelligence

AI summaryⓘ

The authors present Piper, a system that helps train large AI models by letting users easily combine different ways to split work across computers. Instead of manually designing and coding these strategies, Piper lets users specify them simply through annotations and scheduling instructions. Piper translates these into a unified plan, then runs it without needing to change the core system for each new strategy. The authors show that Piper works as well as existing methods like ZeRO and can also improve efficiency by combining multiple parallelism approaches.

data parallelismpipeline parallelismexpert parallelismZeRO optimizationdistributed trainingintermediate representation (IR)training DAGruntime systemmodel parallelismscheduling

Authors

Megan Frisella, Shubham Tiwari, Andy Ruan, Yi Pan, Parker Gustafson, Mat Jacob, Gilbert Bernstein, Stephanie Wang

Abstract

Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism, together with memory-saving optimizations like ZeRO. Deployed systems for foundation model pretraining often rely on human experts to manually design a high-level parallelism strategy then implement the corresponding low-level execution strategy, making it difficult to adapt the system to new strategies. Meanwhile, many general-purpose frameworks are more flexible but their implementations are still tied to a fixed set of common parallelism strategies, making it challenging to integrate state-of-the-art strategies. We present Piper, a user-controllable distributed training system that decouples the strategy from the runtime implementation. Piper allows users to declare a comprehensive distributed training strategy with a small set of model annotations and scheduling directives. Each directive applies a transformation on Piper's intermediate representation (IR), a unified global training DAG that represents all computation and communication. Using this IR, Piper compiles per-device execution plans and executes them with a distributed runtime agnostic to the strategy. We show that the combined system maintains performance parity on commonly available strategies such as ZeRO, while also enabling additional performance and memory efficiency gains through joint scheduling of compute and communication in composed parallelism strategies such as DeepSeek-V3's DualPipe.

View PDFOpen arXiv