Factored Gossip DiLoCo: Reducing Blocking Communication in DiLoCo

2026-06-22 • Machine Learning

Machine LearningDistributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors address the problem of slow and fragile synchronization in distributed training of very large language models, especially when internet bandwidth is low. They improve on an existing method called DiLoCo by using an approximate synchronization method called mixing or gossip, which is less affected by delays and failures. Their approach splits synchronization into two parts: one that runs without stopping and overlaps with computing, and another that occasionally tightens coordination between workers. This results in better use of computing resources and similar or better training progress, even when network conditions are poor or failures happen.

distributed traininglanguage modelssynchronizationDiLoComixinggossip protocolcompute utilizationstragglersbandwidthoptimization stability

Authors

Chamin Hewa Koneputugodage, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Hadi Mohaghegh Dolatabadi, Shamane Siriwardhana, Gil Avraham, Violetta Shevchenko, Karol Pajak, James Snewin, Alexander Long

Abstract

To make large-scale distributed training practical outside high-bandwidth datacenters, we must reduce blocking, high-volume synchronization. While DiLoCo communicates infrequently, its outer synchronization remains bandwidth-heavy and brittle to stragglers and transient failures. We relax exact synchronization to approximate synchronization via mixing/gossip, which degrades gracefully under delays and communication failures. This allows us to factorize DiLoCo synchronization into a non-blocking mixing step that overlaps computation with no staleness, and a blocking mixing step that tightens worker agreement, yielding a tunable trade-off between compute utilization and optimization stability. On up to billion-parameter language models in low-bandwidth settings, our framework substantially improves compute utilization compared to DiLoCo, with training progress ranging from comparable to closely matching it, and is more robust to failures.

View PDFOpen arXiv