Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs
2026-06-01 • Distributed, Parallel, and Cluster Computing
Distributed, Parallel, and Cluster Computing
AI summaryⓘ
The authors study how to make simulating complex quantum problems faster by improving how calculations involving big tensors are split across many GPUs. Instead of the usual method called slicing, which wastes time by doing repeated work, they share parts of the data between GPUs smartly to reduce unnecessary steps. They show that on 8 GPUs connected closely, their method is much faster, and when scaled to over 1000 GPUs, it outperforms the old approach by a huge margin. This means their communication-aware method lets computers handle very large quantum simulations more efficiently.
tensor network contractionquantum circuit simulationGPU parallelizationslicingDGX H100NVLinkGEMM (General Matrix Multiply)InfiniBandmode reorderingcommunication-aware scheduling
Authors
Feng Pan, Hanfeng Gu, Paul Springer, Xipeng Li
Abstract
Exact tensor network contraction underpins quantum circuit simulation, quantum error correction, combinatorial optimization, and many-body dynamics. The dominant parallelization strategy, slicing, scales exponentially and incurs redundant computation. We present a multi-GPU framework that instead distributes intermediate tensors across devices with explicit communication, converting a fixed contraction path into a communication-efficient schedule via GEMM-oriented mode reordering and communication-aware mode distribution planning. Within a single DGX H100 node (8 GPUs, NVLink), distribution delivers $7$--$173\times$ extra speedup beyond embarrassingly parallel slicing, capturing nearly all of the available compute reduction (87--101%) because NVLink's high bandwidth keeps communication small relative to compute. Scaling the same four workloads to 1024 H100 GPUs over InfiniBand, the extra speedup beyond slicing ranges from $42\times$ to $67{,}869\times$, demonstrating that communication-aware distributed contraction far surpasses slicing-based scaling limits for frontier tensor networks.