Node-Level Performance and Energy Characterization of Flagship Science Applications on SuperMUC-NG Phase 2
2026-06-22 • Distributed, Parallel, and Cluster Computing
Distributed, Parallel, and Cluster ComputingPerformance
AI summaryⓘ
The authors tested how well five scientific programs run on a powerful computer with both CPUs and GPUs. They checked how fast and energy-efficient these programs were when using only CPUs versus using both CPUs and GPUs. The results showed that using GPUs made the programs run 4 to 12 times faster and use up to 15 times less energy, especially for some codes like lammps and AthenaK. However, if the GPU didn't have enough work to do at once, the benefits decreased. They also found that CPUs often didn't use all their available power during these tests.
supercomputerCPUGPUenergy efficiencythroughputmolecular dynamicsfinite-element methodIntel Xeon PlatinumPonte Vecchio GPUenergy-aware runtime
Authors
Salvatore Cielo, Elmira Birang, Alexander Pöppl, Sajad Azizi, Plamen Dobrev, Margarita Egelhofer, Ivan Pribec, Gerald Mathias
Abstract
We present a systematic performance and energy-efficiency characterization of five flagship scientific workloads on SuperMUC-NG phase 2, the 28 PetaFLOPs system at the Leibniz Supercomputing Center (LRZ) equipped with Intel Xeon Platinum 8480+ and Intel Data Center GPU Max 1550 (Ponte Vecchio, PVC) accelerators. The selected codes span molecular dynamics (gromacs, lammps), astrophysics and cosmology (OpenGadget3, AthenaK), and finite-element PDE solvers from the dealii-X Center of Excellence. For each code we measure throughput and energy efficiency expressed as compute-elements per wall-clock second (or per Joule of consumed energy) on a single compute node, comparing CPU-only (SPR) against combined CPU+GPU (SPR+PVC) configurations where available. Energy measurements rely on lightweight code instrumentation with p3em, or the Energy Aware Runtime (EAR) present on the system. Our results show that GPU offload yields $4-12\times$ higher throughput and up to $15\times$ better energy efficiency compared to CPU-only execution, with lammps and AthenaK benefiting most. However, both throughput and energy gains are sensitive to problem granularity: insufficient work per GPU tile erodes the accelerator advantage, as clearly observed in AthenaK at small mesh-block sizes. The power-budget utilization is systematically lower for CPUs than it is for GPUs, indicating that even at peak useful-work rate, most applications running on CPUs leave a significant fraction of the node's thermal envelope unused.