Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

2026-04-09 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors studied how to better share powerful GPUs among different scientific and AI programs. They looked at a method called Multi-Instance GPU (MIG), which splits a GPU into fixed parts for different tasks. Their research found that while MIG helps use GPUs more efficiently, some issues like shared power limits still cause problems. To improve this, the authors suggest a way to offload memory using a fast interconnect to better match how programs need resources.

GPUMulti-Instance GPU (MIG)high-performance computing (HPC)resource utilizationNvlinkcache-coherent interconnectmemory offloadingpower throttlingAI workloads

Authors

Gabin Schieffer, Ruimin Shi, Jie Ren, Ivy Peng

Abstract

Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application characteristics may result in imbalanced utilization. Multi-Instance GPU (MIG) is a promising approach to improve utilization by partitioning GPU compute and memory resources into fixed-size slices with isolation. Yet, its effectiveness and limitations in supporting HPC workloads remain an open question. We present a comprehensive system-level characterization of different GPU sharing options using real-world scientific, AI, and data analytics applications, including NekRS, LAMMPS, Llama3, and Qiskit. Our analysis reveals that while GPU sharing via MIG can significantly reduce resource underutilization, and enable system-level improvements in throughput and energy, interference still occurs through shared resources, such as power throttling. Our performance-resource scaling results indicate that coarse-grained provisioning for tightly coupled compute and memory resources often mismatches application needs. To address this mismatch, we propose a memory-offloading scheme that leverages the cache-coherent Nvlink-C2C interconnect to bridge the gap between coarse-grained resource slices and reduce resource underutilization.

View PDFOpen arXiv