O-POPE: High-Frequency Pipelined Outer Product based GEMM acceleration with minimal buffering overhead

2026-06-01 • Hardware Architecture

Hardware Architecture

AI summaryⓘ

The authors present O-POPE, a new design for speeding up matrix multiplication used in machine learning. They improve hardware efficiency by using floating-point unit pipeline registers as buffers, which helps the system run fast and use its resources well. This design works at high speed (1 GHz) and uses very little extra chip area for buffering. Testing shows O-POPE performs better and uses less energy compared to current top accelerators. Their approach balances speed, utilization, and hardware cost effectively.

General Matrix Multiply (GEMM)floating-point unit (FPU)outer-product executionquantizationpipeline registersmachine learning accelerationFINFET technologyperformance densityenergy efficiency

Authors

Danilo Cammarata, Angelo Garofalo, Luca Benini

Abstract

General matrix multiply (GEMM) dominates both execution time and energy consumption of modern machine learning (ML) workloads, placing increasing pressure on hardware efficiency. While quantization mitigates computational and data movement costs, accuracy-sensitive tasks such as training still require higher-precision floating-point formats. Existing floating-point GEMM accelerators face trade-offs between operating frequency, arithmetic utilization, and buffering overhead. This work presents O-POPE, a scalable outer-product engine that achieves concurrently high utilization, low overhead, and a fast operating frequency by repurposing floating-point unit (FPU) pipeline registers as buffers. This solution leverages the data-reuse advantages of output-stationary outer-product execution and enables 1 GHz (0.72 V) operation in 12 nm FINFET technology with less than 2% buffer area for a 2048-MACs configuration. Our evaluation shows that O-POPE achieves up to 99.97% FPU utilization and improves performance (1.33x), performance density by 9%, and energy efficiency by 8%, compared to state-of-the-art floating-point GEMM accelerators.

View PDFOpen arXiv