Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding

2026-06-22 • Hardware Architecture

Hardware ArchitectureDistributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors focus on speeding up a common task where each item in a list is compared to a single number, which is important in data-heavy fields like databases and machine learning. They identify that current methods using DRAM for processing data directly have delays because the comparison process gets slower as numbers get bigger. Their solution, called Clutch, changes how data is represented and compared inside the memory to make this process much faster and more efficient. By breaking data into chunks and comparing these parts smartly, their method boosts speed and saves energy compared to both traditional CPUs/GPUs and previous DRAM-based approaches. They also show how their method can be applied to decision tree tasks, opening up new uses for this technology.

vector-scalar comparisonprocessing-using-DRAM (PuD)bit-serial executiontemporal codinglookup tablepredicate evaluationdecision tree inferencememory-bound computationenergy efficiencychunked data representation

Authors

Daichi Tokuda, Tatsuya Kubo, Ismail Emir Yuksel, Ataberk Olgun, Haocong Luo, Tomoya Nagatani, Geraldo F. Oliveira, Abdullah Giray Yağlıkçı, Mohammad Sadrosadati, Onur Mutlu, Shinya Takamaeda-Yamazaki

Abstract

Vector-scalar comparison is a fundamental computation primitive that compares each element in a vector against a single scalar value. It is widely used in various data-intensive workloads from databases to machine learning. Due to its low computational intensity, its execution tends to be memory-bound, limiting the utilization of compute resources. Processing-using-DRAM (PuD) is an emerging computing paradigm that performs massively parallel bitwise operations directly inside DRAM arrays, alleviating off-chip data movement. Existing PuD-based approaches require many DRAM commands because the comparison's algorithmic complexity grows with operand bit-width in the bit-serial execution model. This command overhead becomes the dominant bottleneck, limiting application-level speedup. We propose Clutch, a data representation and comparison algorithm that accelerates vector-scalar comparisons in PuD systems with high efficiency and scalability. Clutch first uses temporal coding, encoding each vector value as a sequence of leading ones, which enables lookup-based comparison against a scalar by accessing the corresponding DRAM row. To avoid the prohibitive memory footprint of lookup tables at high precision, Clutch partitions operands into multiple multi-bit chunks, compares chunks independently using compact lookup tables, and merges the per-chunk results with a PuD-efficient procedure. By adjusting the number of chunks, Clutch provides a flexible tradeoff between throughput and memory usage. Across predicate evaluation and decision tree inference, Clutch improves end-to-end application throughput and energy efficiency by an average of 12x and 69x over highly optimized CPU and GPU execution, and by 2.9x and 3.0x over the state-of-the-art bit-serial PuD implementation. We also present the first mapping of decision tree inference to PuD execution, extending PuD to a new application domain.

View PDFOpen arXiv