MPX: A Unified Systolic Array for Matrix and Polynomial Multiplication

2026-06-15Cryptography and Security

Cryptography and SecurityHardware Architecture
AI summary

The authors focus on improving how computers multiply polynomials, which is important for modern encryption methods. Instead of using special hardware designed only for encryption, they designed a system called MPX that can do both regular matrix multiplication and polynomial multiplication using the same setup. This design only slightly increases the hardware size and power use, but it makes polynomial multiplication faster than previous methods. Their approach cleverly matches the natural data flow of systolic arrays to speed up polynomial calculations.

Polynomial multiplicationFully Homomorphic EncryptionPost-quantum cryptographyNumber Theoretic TransformSystolic arrayMatrix multiplicationCryptographic acceleratorHardware designLatencyPower overhead
Authors
George Alexakis, Dimitrios Schoinianakis, Giorgos Dimitrakopoulos
Abstract
Polynomial multiplication is a fundamental kernel in Fully Homomorphic Encryption (FHE) and post-quantum cryptography (PQC) and is commonly accelerated through Number Theoretic Transforms (NTTs). To avoid the cost of designing dedicated cryptographic accelerators, recent efforts have mapped NTT computations onto existing systolic matrix engines, enabling the reuse of AI hardware for cryptographic workloads. In this work, we take the opposite approach. We observe that the wavefront dataflow of systolic arrays naturally aligns with the accumulation pattern of polynomial multiplication and leverage this correspondence to design MPX, a dual-mode systolic array that supports both matrix multiplication and direct polynomial multiplication within the same hardware fabric. Experimental results show that extending a conventional systolic array with this dual-mode capability requires only 20% additional area and introduces negligible power overhead during matrix-multiplication execution. In polynomial-multiplication mode, MPX achieves more than 1.2x lower latency compared to NTT-based polynomial multiplication on systolic matrix engines.