Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory
2026-05-25 • Hardware Architecture
Hardware Architecture
AI summaryⓘ
The authors looked at speeding up a way computers find items similar to a given one (Approximate Nearest Neighbor Search) by using special memory chips that can do some computing themselves (Processing-in-Memory or PIM). They fixed problems like limited local memory and weak computing power by redesigning the data layout, improving scheduling, and simplifying calculations. Their new design works much faster than regular CPUs and GPUs, and better than older PIM methods while keeping accuracy very high. It also works well when spreading tasks across multiple machines.
Approximate Nearest Neighbor SearchGraph-based ANNSProcessing-in-Memory (PIM)Memory-bound workloadGraph traversalIndex layoutAsynchronous pipelined schedulerDistance kernelCPU throughputHigh-recall regime
Authors
Sitian Chen, Yusen Li, Yao Chen, Minwen Deng, Jintao Meng, Amelie Chi Zhou
Abstract
Approximate Nearest Neighbor Search (ANNS) is a core primitive in modern AI systems, and graph-based methods currently offer the best accuracy-efficiency trade-off at scale. The workload is fundamentally memory-bound: graph traversal produces frequent, irregular memory accesses that cap CPU throughput at main-memory bandwidth, while GPUs lack the high-bandwidth memory capacity to host billion-scale indexes. Processing-in-Memory (PIM) is a natural candidate, as placing computation next to data unlocks the abundant internal bandwidth that such bandwidth-starved workloads demand. Porting graph-based ANNS to PIM, however, exposes several architectural mismatches: each processing unit has only a small local memory, inter-unit communication is costly, host coordination adds overhead, and in-memory compute units are relatively weak -- limitations that have forced prior PIM-based ANNS designs to fall back on cluster-based indexing, whose recall ceiling is far below that of graph methods. This paper presents an algorithm-architecture co-design that overcomes these obstacles through three components: a compacted index layout that shrinks the PIM-resident memory footprint by 14.5x; an asynchronous pipelined scheduler that keeps the host-to-PIM interconnect saturated; and a multiplication-free distance kernel that loses under 0.08% recall. Across three billion-scale benchmarks, the proposed design achieves up to 20x and 17.1x higher throughput than CPU and GPU baselines, respectively, outperforms prior PIM accelerators by 129x in the high-recall regime, and scales gracefully across multi-node deployments and emerging PIM architecture.