PALUTE: Processing-In-Memory Acceleration via Lookup Table for Edge LLM Inference

2026-06-08Hardware Architecture

Hardware ArchitectureEmerging Technologies
AI summary

The authors address the challenge of running large language models efficiently on small, low-power devices. They designed PALUTE, a special memory-based system that speeds up calculations by storing answers in lookup tables inside a 3D memory chip, reducing the need for repeated math operations. Their method lowers energy use and improves speed and size efficiency compared to previous designs. PALUTE is especially good for handling both multiplication-heavy steps and special math functions needed in language models.

Large Language ModelsEdge DevicesQuantized InferenceLookup Tables (LUT)3D DRAMProcessing-In-MemoryGEMMNonlinear OperatorsEnergy EfficiencyRTL Synthesis
Authors
Runyang Tian, Yanru Chen, Weihong Xu, Tajana Šimunić Rosing
Abstract
Large language models are increasingly deployed on edge devices with tight power and area budgets. While mixed-precision GEMM reduces arithmetic complexity, quantized inference is often dominated by dequantization and nonlinear operators. Lookup Table (LUT)-based method mitigates these costs by precomputing outputs and replacing repeated arithmetic with table lookups, but existing designs incur significant capacity and lookup-latency overheads. This paper presents PALUTE, a LUT-based Processing-In-Memory accelerator built on Monolithic 3D DRAM for efficient edge LLM inference. PALUTE enables in-DRAM LUT queries that exploit the vertical organization of M3D DRAM memory array tiles to achieve high parallelism with low area overhead. A near-memory LUT generator supports low-latency LUT generation for both GEMM and element-wise unary nonlinear operators, while a system-level tiering and scheduling strategy minimizes data movement across memory tiers. Evaluation using cycle-accurate simulation and RTL synthesis shows that PALUTE achieves 1,264 TPS end-to-end throughput at 0.16 W, improving energy efficiency by 12.8$\times$ over CHIME and 1.6$\times$ over FIGLUT, improving area efficiency by 2.0$\times$ over PIMPAL under W4A4 across Qwen3-4B models.