FlowTrain: Flow-Based Decoupled Training for Industrial-Grade Vision-Language Models
2026-06-22 • Machine Learning
Machine Learning
AI summaryⓘ
The authors look at how training vision-language models (which combine images and text) is less efficient than training language-only models. They propose FlowTrain, a new training method that treats the process like a flow of data where different parts (like the encoder and backbone) work more independently but stay connected through shared memory. This helps the system allocate resources better and balance the workload dynamically. Their experiments show this approach leads to significantly better efficiency and throughput compared to previous methods.
vision-language modelsdistributed trainingparallelismdataflowencoderbackbonememory poolthroughputbatch schedulingmodel training efficiency
Authors
Zhida Jiang, Zhaolong Xing, Yang Pei, Xiaolong Chen, Yuanhang Xiao, Chengzhi Huang, Xiyu Liu, Haopeng Liu, Qingyuan Sang, Lingfeng Zhou, Jiaxing Wang, Zicheng Zhang, Wenzhe Wang, Xinyu Liu, Yan Li, Zhen Chen, Ke Zhang
Abstract
Industrial-grade distributed training of vision-language models (VLMs) remains far less efficient than that of unimodal LLMs. Existing solutions either follow a monolithic design that assigns uniform parallelism to heterogeneous modules or adopt a disaggregated deployment that separates modules while executing them as a batch-synchronized pipeline. In this paper, we highlight that the above solutions are still not sufficient, and VLM training can be further decoupled. To this end, we present FlowTrain, a flow-based decoupled training framework that reformulates VLM training as a producer-consumer dataflow coordinated through a unified memory pool. The encoder and backbone can progress independently over a global virtual address space. Since this execution decoupling fundamentally changes the optimization objective of allocation and scheduling, FlowTrain further introduces a heterogeneous parallel allocator that assigns module-specific parallelism strategies by solving a throughput matching problem. The dynamic packing scheduler is used to construct balanced microbatches at runtime according to the actual LLM-side computation cost. Extensive experiments on real-world workloads show that FlowTrain achieves over 50% MFU and up to 1.7x throughput improvement, narrowing the efficiency gap to LLM-only training.