FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

2026-06-22Cryptography and Security

Cryptography and SecurityMachine LearningOperating Systems
AI summary

The authors address the challenge of running large language models (LLMs) securely on mobile devices using ARM TrustZone, which normally causes a slowdown for both secure and regular apps. They propose FlexServe, a system that separates who can use secure resources from who can manage them, allowing the normal operating system to handle resources without accessing their secure content. FlexServe introduces a special memory and processing unit that only the secure part can use but the regular OS can still manage efficiently. Their tests show FlexServe makes secure LLM use much faster than other TrustZone-based methods.

Large Language Models (LLMs)ARM TrustZoneSecure InferenceResource IsolationSecure MemoryNormal-world OSSecure-world OSMobile DevicesRecallable Secure MemoryNeural Processing Unit (NPU)
Authors
Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia
Abstract
Device-side Large Language Models (LLMs) have grown explosively, offering stronger privacy and higher availability than their cloud-side counterparts. During LLM inference, both the model weights and the user data are valuable, and attackers may compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead to both the secure inference and the normal aplications, due to two challenges: the inflexible resource isolation and the inefficient secure resource management. To address these challenges, this paper presents FlexServe, a fast and secure LLM inference system for mobile devices. The key idea is to decouple the access permission from the management permission of secure resources, so that the normal-world OS cannot access them but can still manage them as usual. First, FlexServe introduces a Recallable Resource Isolation mechanism to construct Recallable Secure Memory (Flex-Mem) and a Recallable Secure NPU (Flex-NPU). They can only be accessed by the secure world, but can be efficiently allocated and reclaimed by the normal-world OS. Based on them, FlexServe further introduces a FlexServe Framework to run secure LLM inference in the secure world. It works together with the normal-world OS to perform cooperative secure memory management. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves average TTFT speedups of 10.05X over the strawman and 2.44X over an optimized strawman.