Architectural Limits of Cloud TPUs in Finite-Field Cryptography

2026-05-25 • Hardware Architecture

Hardware Architecture

AI summaryⓘ

The authors studied why cloud Tensor Processing Units (TPUs), designed for AI tasks, perform much worse than GPUs when doing exact math needed for cryptography. They found TPUs have huge efficiency deficits due to lacking certain integer math units and how their hardware is used. They showed that constraints in running multiple tasks together cause part of this inefficiency, but an inherent arithmetic penalty remains. Using a custom tool to map math operations efficiently, they revealed that TPUs struggle to fully use their processing units, making them poorly suited for these cryptographic computations. Overall, their work highlights fundamental hardware limitations of TPUs for precise finite-field arithmetic.

Tensor Processing Units (TPUs)GPUsfinite-field cryptographyMontgomery reductionNumber Theoretic Transformsystolic arraysFP32int32 accumulatormulti-tenant deploymentXLA fusion engine

Authors

Hung Dang, Xuan Phu Dang, Tue Nguyen

Abstract

We empirically characterise the cost-efficiency deficit between cloud Tensor Processing Units and GPUs for finite-field cryptography. Against A100 GPU baselines (cuZK), we measure a $[5{,}558\times, 6{,}908\times]$ deficit across v5p and v4 architectures under an FP32-mantissa staging discipline, and a $\sim$$4{,}693\times$ deficit using v5p's native \texttt{int32} accumulator. We analytically project this deficit into a fundamental arithmetic penalty (lacking wide-integer ALUs) and a spatial penalty. We demonstrate that evaluating concurrent multi-tenant deployments, where strict separation forces eager Montgomery reduction, yields a projected $5.19\times$ spatial collapse; relaxing this constraint theoretically recovers these spatial cycles, yet the underlying arithmetic penalty remains. To facilitate this characterisation, we deploy \codename as a measurement vehicle. By mapping low-degree polynomials onto matrix-form Number Theoretic Transforms, the scheduler stacks heterogeneous polynomials into dense 2D matrices, achieving $\sim$$100\%$ K-dimension column occupancy on uniform workloads ($>$$92\%$ on mixed-degree traces). However, despite optimal K-dimension packing, severe M-dimension under-utilisation (e.g., $6.25\%$ on v4) combined with overwhelming VPU-bound Montgomery reduction stalls mathematically starve the systolic arrays. A post-hoc HLO validator ensures these measurements remain structurally isolated against the XLA fusion engine. Our findings empirically demonstrate the structural inadequacy of AI-optimised systolic arrays for exact, high-throughput field arithmetic.

View PDFOpen arXiv