SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions

2026-06-15 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingArtificial IntelligencePerformance

AI summaryⓘ

The authors explore how new CPU features called matrix extensions (like Arm SME) can speed up parts of language model tasks but don't work well for all parts. They analyze how these extensions and regular CPU cores compete for resources and perform differently depending on the task. To solve this, they build SMEPilot, an engine that smartly splits work between the matrix units and CPU cores to make inference faster. Their approach improves language model inference speed up to nearly 4 times on various devices.

CPUmatrix extensionArm Scalable Matrix ExtensionLLM inferenceoperator-level executionroofline modelattention mechanismtensor layoutKV-cacheperformance optimization

Authors

Feiyang Chen, Haibo Chen

Abstract

Modern CPUs increasingly integrate matrix extensions, such as Arm Scalable Matrix Extension (SME), that provide high-throughput matrix execution within the CPU. For LLM inference, however, these units are not a universal replacement for conventional CPU cores: prefill, decode, attention, and KV-cache operations expose different arithmetic intensities, vector behavior, and layout requirements, while SME units and CPU cores still compete for shared memory bandwidth. This paper studies this mismatch through a roofline-based characterization of SME-enabled CPUs and uses the resulting model to guide operator-level execution choices. We present SMEPilot, an LLM inference engine that selects CPU-only, SME-only, or cooperative SME+CPU execution for each operator shape. SMEPilot partitions matrix work across SME and CPU cores at tile granularity, overlaps SME-suitable matrix stages with CPU-suitable vector stages in attention, and maintains layout state so packed tensor representations are reused rather than repeatedly rebuilt on critical paths. Across Llama-3.2-3B, Qwen3-4B, and Qwen3-30BA3B on phone, PC, and server platforms, SMEPilot improves end-to-end inference performance by up to 3.94$\times$.

View PDFOpen arXiv