Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

2026-04-09Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing
AI summary

The authors studied how to better share powerful GPUs among different scientific and AI programs. They looked at a method called Multi-Instance GPU (MIG), which splits a GPU into fixed parts for different tasks. Their research found that while MIG helps use GPUs more efficiently, some issues like shared power limits still cause problems. To improve this, the authors suggest a way to offload memory using a fast interconnect to better match how programs need resources.

GPUMulti-Instance GPU (MIG)high-performance computing (HPC)resource utilizationNvlinkcache-coherent interconnectmemory offloadingpower throttlingAI workloads
Authors
Gabin Schieffer, Ruimin Shi, Jie Ren, Ivy Peng
Abstract
Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application characteristics may result in imbalanced utilization. Multi-Instance GPU (MIG) is a promising approach to improve utilization by partitioning GPU compute and memory resources into fixed-size slices with isolation. Yet, its effectiveness and limitations in supporting HPC workloads remain an open question. We present a comprehensive system-level characterization of different GPU sharing options using real-world scientific, AI, and data analytics applications, including NekRS, LAMMPS, Llama3, and Qiskit. Our analysis reveals that while GPU sharing via MIG can significantly reduce resource underutilization, and enable system-level improvements in throughput and energy, interference still occurs through shared resources, such as power throttling. Our performance-resource scaling results indicate that coarse-grained provisioning for tightly coupled compute and memory resources often mismatches application needs. To address this mismatch, we propose a memory-offloading scheme that leverages the cache-coherent Nvlink-C2C interconnect to bridge the gap between coarse-grained resource slices and reduce resource underutilization.