DisagFusion: Asynchronous Pipeline Parallelism and Elastic Scheduling for Disaggregated Diffusion Serving

2026-05-25 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors address the problem of running large diffusion-based image generation models, which are too big for single GPUs and have uneven workloads across different parts. They propose DisagFusion, a system that splits the work into separate stages running on different GPUs and uses smart scheduling to keep everything balanced and efficient. This approach reduces waiting time between stages and adapts to changing workloads, leading to much faster processing and lower delays compared to running the model all at once. Their tests show DisagFusion can be 3 to 20 times faster and reduce latency significantly.

diffusion modelspipeline parallelismGPU memorymodel servingasynchronous executionschedulingheterogeneous GPUsthroughputlatencyperformance optimization

Authors

Hantian Zha, Teng Ma, Yang Yong, Haiwen Fu, Ruiyang Ma, Wei Gao, Ruihao Gong, Xianglong Liu, Wei Wang, Yunpeng Chai

Abstract

Diffusion-based generation is increasingly powering production content pipelines; however, deploying these models at scale remains a significant challenge. Model weights frequently exceed the memory capacity of commodity GPUs, while the encoder, diffusion transformer (DiT), and decoder stages exhibit highly imbalanced computational and memory footprints. A natural remedy is disaggregated serving-running stages as separate services on heterogeneous GPUs-yet this introduces new bottlenecks, including stage handoff overheads and fast-changing workloads that make cross-stage provisioning and scheduling brittle. This paper presents DisagFusion, enabling asynchronous pipeline parallelism and elastic scheduling for disaggregated diffusion serving. First, DisagFusion introduces asynchronous pipeline parallelism that overlaps computation and stage-to-stage communication to reduce pipeline bubbles and mitigate network jitter. Second, DisagFusion employs a hybrid instance scheduling strategy that combines lightweight performance prediction with runtime feedback to continuously rebalance instance ratio across stages under workload shifts. We implement DisagFusion and evaluate it with modern diffusion models. Compared to a monolithic baseline, DisagFusion improves throughput by 3.4x-20.5x and reduces end-to-end latency by 18.5x, while enabling flexible, cost-efficient deployment across heterogeneous GPUs.

View PDFOpen arXiv