Scaling LLM Inference Beyond Amdahl`s Limits via Eliminating Non-Scalable Overheads

2026-06-01Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing
AI summary

The authors study how to get the best speed from large language models running on multiple GPUs. They explain that while using more GPUs in parallel (called tensor parallelism) can help handle big models, it doesn't always make things faster because of communication delays and other slow parts. They find an ideal level of parallelism that balances these issues. Their system, Albireo, improves how tasks overlap and communicates between GPUs better, leading to faster and more efficient performance without changing the models themselves. Tests show Albireo runs models faster, with less delay and energy use compared to previous systems like vLLM.

Tensor parallelismLarge language modelsGPU utilizationKV-cacheAmdahl's LawParallel inferenceThroughputLatencyEnergy efficiencyDistributed computing
Authors
Alan Zhao, Cyril Y. He, Wei Xu
Abstract
Deployers of online LLM services usually seek to maximize cluster-wide performance given a fixed number of GPUs. Tensor parallelism (TP) is necessary to fit modern models but scales sub-linearly as the TP degree t grows, due to cross-GPU communication and non-scalable runtime work, as predicted by Amdahl's Law. Conversely, increasing t improves memory efficiency and alleviates KV-cache contention and swapping. We identify and validate an empirical optimal TP degree t_e that balances these effects. We present Albireo, a parallel inference system that raises the attainable t_e by shrinking the non-scalable portion via overlap of scheduling and I/O with compute and sequence-parallel sampling, without changing model architectures. Across models and benchmarks, Albireo achieves up to 1.9x higher throughput, 48% lower latency, 28% higher GPU utilization, and 54% lower energy than vLLM; in production it yields up to 2x higher throughput.