Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization
2026-06-15 • Distributed, Parallel, and Cluster Computing
Distributed, Parallel, and Cluster Computing
AI summaryⓘ
The authors address the challenge of training large language models on GPU clusters made up of different kinds of GPUs, which makes planning how to split up the work very complex. They created Tangram, a system that groups similar GPUs into clusters called islands and then breaks the model into parts that can be assigned to these islands. This allows existing tools, which usually assume all GPUs are the same, to work well even when the GPUs are different. Tangram improves training speed by making these plans more efficient and can handle big clusters by narrowing down possible plans.
Large Language Models (LLM)GPU clustersParallelizationHeterogeneous GPUsModel PartitioningExpert ParallelismZeROWork BalancingThroughputPipeline Parallelism
Authors
Yanda Tao, Pedro F. Silvestre, Marcel Wagenländer, Peter Pietzuch
Abstract
The scale of LLM training jobs requires parallelization planning over large GPU clusters. Due to different GPU types and interconnects added over time, these GPU clusters are increasingly heterogeneous. Automatic LLM parallelizers can search for parallelization plans but face an exploding search space with heterogeneous GPUs. To make search tractable in heterogeneous GPU clusters, parallelizers often omit types of parallelism (e.g., expert parallelism) or memory-saving techniques (e.g., ZeRO), which results in worse plans. We describe Tangram, a system that enables the use of existing heterogeneity-unaware LLM parallelizers in heterogeneous GPU clusters by decoupling parallelization planning from GPU heterogeneity. For this, Tangram exploits two insights: (1) since bulk purchases result in sets of GPUs with similar compute, memory, and connectivity, Tangram can expose such homogeneous GPU islands to existing parallelizers; and (2) parallelizers commonly first partition models and then parallelize partitions. Tangram can compose such model slices, assigned to GPU islands, into work-balanced pipelines for high throughput. Tangram integrates with existing parallelizers through a narrow API, which relies on the enumeration of model-slice/island pairs. Tangram achieves up to 2.3x higher training throughput than current heterogeneous parallelizers (Metis and Sailor) and scales to large GPU clusters by pruning enumerated plans.