Bandwidth-Aware LLM Inference on Heterogeneous Many-Core Supercomputers

2026-05-25 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors address the challenge of running large language models (LLMs) efficiently on special many-core processors with limited memory bandwidth, using the Tianhe supercomputer's MT-3000 as an example. They created THInfer, a system that carefully matches software to hardware to reduce data movement and speed up processing. THInfer uses optimized math operations, smart task scheduling, and a pipeline design to work well across multiple processor clusters. Tests showed it performs better than existing GPU-based frameworks, especially on larger models where GPUs struggle. Overall, the authors provide a practical way to run big language models more efficiently on complex many-core computers.

Large Language Models (LLM)InferenceMany-core ProcessorsMemory BandwidthHardware-Software Co-designVLIW SIMD ArchitectureKernel OptimizationPipeline ParallelismMPI (Message Passing Interface)Model Throughput

Authors

Yao Lu, Zhongzhi Luan, Gen Li, Jiaxing Qi, Shiqing Ma, Bin Han, Shizhe Shang, Hailong Yang, Depei Qian

Abstract

Large language model (LLM) inference is limited by high computational cost and memory bandwidth demands, making deployment on heterogeneous many-core processors challenging. Taking the MT-3000 processor used in the Tianhe supercomputer as an example, its limited main-memory bandwidth and distributed memory hierarchy exemplify these bottlenecks, making it difficult to directly migrate existing GPU-based inference frameworks. To address this problem, we propose THInfer, a hardware-aware inference framework that maximizes data locality under bandwidth-constrained conditions through hardware-software co-design and parallel strategy optimization. THInfer incorporates three key techniques: (1) a high-performance operator library for the VLIW SIMD architecture, providing hand-optimized FP16 kernels that achieve up to 70 percent of the peak performance per cluster; (2) a density-driven computation graph fusion and unified kernel scheduling mechanism, combined with a staged pipelined attention fusion method; and (3) a Prefill-Buffer-Decode (P-B-D) pipeline and bounded buffer management strategy, which supports hybrid parallelism and enables efficient multi-cluster collaboration through two-level communication based on MPI and hthreads. Experiments on the Llama model series show that THInfer improves throughput on the 7B model by 62 percent to 73 percent over DeepSpeed on two V100S GPUs and by 67 percent to 84 percent over the A800 GPU. The 13B and 30B models also demonstrate comparable or better performance. Moreover, THInfer maintains stable performance on the 70B model, whereas typical GPU-based frameworks fail to run under the same setting. Overall, THInfer significantly enhances throughput, reduces latency, and improves scalability, providing a feasible system solution for efficient and scalable LLM inference on heterogeneous many-core architectures.

View PDFOpen arXiv