Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training

2026-06-15Machine Learning

Machine Learning
AI summary

The authors studied how to train language models that can understand very long text inputs more efficiently, especially when using slow internet connections between computers. They noticed that the usual way of sharing data between devices is too slow and uses a lot of bandwidth. To fix this, they created a method that compresses the communication by over 95% without slowing down training or losing accuracy. This works by cleverly capturing important information in smaller, easier-to-share pieces. Their method allows very large models to handle context lengths over 100,000 words even on slow networks, performing as well as models trained on very fast networks.

language modelsextended context windowsdecentralized trainingcommunication compressionlow-rank structureactivation outputsreparameterizationcontext parallelismmodel convergencebandwidth efficiency
Authors
Sameera Ramasinghe, Ajanthan Thalaiyasingam, Hadi Mohaghegh Dolatabadi, Gil Avraham, Violetta Shevchenko, Yan Zuo, Chamin Hewa Koneputugodage, Alexander Long
Abstract
Pretraining language models with extended context windows enhances their ability to leverage rich information during generation. Existing methods split input sequences into chunks, broadcast them across multiple devices, and compute attention block by block which incurs significant communication overhead. While feasible in high-speed clusters, these methods are impractical for decentralized training over low-bandwidth connections. We propose a compression method for communication-efficient context parallelism in decentralized settings, achieving a remarkable compression rate of over 95\% with negligible overhead and no loss in convergence. Our key insight is to exploit the intrinsic low-rank structure of activation outputs by dynamically constraining them to learned mixtures of subspaces via efficient reparameterizations. We demonstrate scaling billion-parameter decentralized models to context lengths exceeding 100K tokens on networks as slow as 300Mbps, matching the wall-clock convergence speed of centralized models on 100Gbps interconnects.