FusionRCG: Orchestrating Recursive Computation Graphs across GPU Memory Hierarchies

2026-05-11Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing
AI summary

The authors address a big problem in quantum chemistry calculations done on GPUs, where limited memory per thread causes slowdowns. They introduce FusionRCG, a system that reorganizes computation and memory use to reduce how much temporary data needs to be stored at once. Their approach cuts down intermediate data sizes and smartly uses different GPU memory levels. Tests on NVIDIA GPUs show significant speedups and good scaling when using many GPUs, helping these calculations run much more efficiently.

Quantum ChemistryHigh-dimensional IntegralsElectron Repulsion IntegralsGPU Memory ManagementComputation GraphsHierarchical RecurrenceSCF (Self-Consistent Field)Parallel EfficiencyKernel ArchitectureCartesian-to-Spherical Transformation
Authors
Yihong Zhang, Xinran Wei, Junshi Chen, Fusong Ju, Wei Hu, Jinlong Yang, Huanhuan Xia
Abstract
Evaluating high-dimensional integrals via deep hierarchical recurrences is a dominant cost in quantum chemistry. While CPUs manage these efficiently, GPUs suffer a critical mismatch: limited per-thread memory is quickly overwhelmed by an explosion of simultaneously live intermediate variables. As recurrence scales, this forces massive data spilling to global memory, collapsing performance into a severe memory-bound regime. We present FusionRCG, a framework that jointly optimizes computation graph structure and GPU memory mapping. Exploiting the inherent topological flexibility of recurrence graphs, using electron repulsion integrals as an example, we contribute: (1) liveness-aware graph orchestration to minimize peak live intermediates; (2) algebraic dimensionality reduction via stepwise Cartesian-to-spherical fusion, shrinking intermediate footprints by up to $7.7\times$; and (3) an adaptive multi-tier kernel architecture routing graphs across the memory hierarchy. Evaluated on NVIDIA A100 GPUs, FusionRCG achieves up to $3.09\times$ end-to-end SCF speedup over GPU4PySCF and maintains $75\%$ parallel efficiency at 64~GPUs, successfully rescuing these workloads from memory-bound limits.