AIA: A 16nm Multicore SoC for Approximate Inference Acceleration Exploiting Non-normalized Knuth-Yao Sampling and Inter-Core Register Sharing

2026-06-15 • Hardware Architecture

Hardware Architecture

AI summaryⓘ

The authors designed a special computer chip named \name{} to help machines make decisions faster and more efficiently using probabilistic models. Their chip uses many small processors working together, each with new tools to speed up the math needed for sampling-based reasoning methods like Markov Chain Monte Carlo (MCMC). They also created software to best organize and run tasks on this chip. Their design is about twice as fast and more energy-efficient than previous chips made for similar jobs. They showed their chip can also handle different types of probabilistic models, proving its flexibility.

Probabilistic Graphical ModelsMarkov Chain Monte Carlo (MCMC)Approximate InferenceRISC-VKnuth-Yao Sampler2D Mesh ArchitectureBayesian NetworksMarkov Random Field (MRF)Compiler OptimizationEdge Computing

Authors

Shirui Zhao, Nimish Shah, Wannes Meert, Marian Verhelst

Abstract

Probabilistic graphical models (PMs) are popular to empower machine learning with the ability of reasoning and decision-making. To perform approximate inference in PMs, sampling-based Markov Chain Monte Carlo (MCMC) algorithms are commonly employed. Unfortunately, MCMC is compute-intensive and hard to run in parallel, resulting in inefficient execution on modern CPU/GPU platforms. This paper proposes \name{}, an Approximate Inference Accelerator designed to empower decision-making and reasoning at the edge. \name{} consists of a RISC-V host, and a 2D mesh of 16 customized RISC-V cores optimized to efficiently support PM inference, each featuring (i) a novel non-normalized Knuth-Yao sampler and interpolation unit; and (ii) core-to-core direct data access via the register file, which provides solutions for compute-intensive operations. To fully exploit the parallel potential of Markov Chain Monte Carlo (MCMC) algorithms, a customized compiler chain has been developed for effective spatial mapping and scheduling on the chip. \name{} can generate 1277 MSample/s at 0.9V and 20 GSamples/s/W at 0.7V which is up to 2$\times$ faster and 1.45x more energy efficient compared to the previous state-of-the-art Markov Random Field (MRF) accelerator. We further map Bayesian Networks benchmark onto \name{} to show the flexibility of our design.

View PDFOpen arXiv