BigMac: Breaking the Pareto Frontier of Compute and Memory in Multimodal LLM Training

2026-05-25Machine Learning

Machine Learning
AI summary

The authors propose BigMac, a new way to train multimodal large language models that combines encoding and generating steps inside the usual training process. This method keeps memory needs low for parts handling images and text without affecting the main language model's memory use. BigMac allows faster training without needing more memory and avoids the usual trade-off between speed and memory efficiency. Tests show it speeds up training by about 8% to 90% compared to older methods, while keeping memory usage stable as batch sizes grow.

Multimodal Large Language ModelsTraining PipelineMemory EfficiencyComputational EfficiencyActivation MemoryEncoderGeneratorBatch SizePareto FrontierNested Pipeline
Authors
Zili Zhang, Chengxu Yang, Shenglong Zhang, Chenyu Wang, Yufan Zhang, Tuo Dai, Zhouyang Li, Yuhong Ge, Chao Jin, Xin Jin, Yuliang Liu
Abstract
Training multimodal large language models (MLLMs) is challenged by both model and data heterogeneity. Existing systems redesign the training pipeline to address these challenges, but remain bound by a Pareto frontier between compute and memory efficiency, improving one only at the expense of the other. We present BigMac, a new training pipeline for multimodal LLMs. The core idea of BigMac is to elegantly nest the encoder and generator computation into the original LLM pipeline, forming a dependency-safe nested pipeline structure. With this design, BigMac reduces the activation memory complexity of the encoder and generator to O(1) while keeping the activation memory complexity of the LLM unchanged. At the same time, it achieves the same computational efficiency as the idealized setting with unlimited memory. As a result, BigMac breaks the Pareto frontier between computational efficiency and memory usage, enabling simultaneous optimization of both computation and memory in MLLM training. We evaluate BigMac on multiple MLLMs and training workloads. Experimental results show that BigMac achieves a 1.08$\times$-1.9$\times$ training speedup over baseline systems while maintaining stable memory usage as batch size increases.