Scalable Concurrent Queues for GPU

2026-06-01Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingData Structures and Algorithms
AI summary

The authors study concurrent queues, which help manage tasks efficiently in supercomputers, especially when there are millions of processor cores. They focus on implementing these queues on GPUs, which have different hardware challenges than CPUs due to their parallel setup. The authors present three new GPU queue designs that provide different guarantees on waiting times and locking, aiming to improve speed and reliability. These designs handle memory and processing constraints while ensuring correct and fast task handling.

concurrent queuesupercomputingGPUlock-freewait-freelinearizabilitySIMTatomic operationsparallelismcompare-and-swap
Authors
Pratheek Prakash Shetty, Thomas R. W. Scogland, Wu-chun Feng
Abstract
Concurrent queues can significantly impact supercomputing performance by being critical bottlenecks for task distribution, load balancing, and resource utilization. As HPC systems move beyond 10-million processor cores, the ability to rapidly move items between producer and consumer threads without excessive locking is essential for efficient queues, preventing idle cores, maximizing utilization, and achieving high parallel speedup. While concurrent queues are well studied on CPUs, they remain largely unexplored on modern GPUs, where SIMT execution, massive parallelism, and atomic contention reshape the design space. We present three linearizable GPU concurrent queues spanning from lock-free to wait-free guarantees: (1) G-WFQ-YMC, an adaptation of Yang and Mellor-Crummey's wait-free queue using preallocated segments; (2) G-LFQ, a bounded lock-free queue that uses wave-batched fast paths to maximize throughput; and (3) G-WFQ, a bounded wait-free queue that packs shared state into 64-bit compare-and-swap operations while preserving linearizability and bounded memory.