FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing

2026-06-08Cryptography and Security

Cryptography and SecurityArtificial Intelligence
AI summary

The authors developed FuseFSS, a new tool that improves how two-server secure systems process language model queries without exposing the user's input. Instead of creating separate protocols for each math step, their compiler uses a single method to handle all fixed-point operations by grouping necessary comparisons and calculations efficiently. This approach speeds up the process by about 1.2 to 1.5 times and reduces data sent during inference, while also shrinking the setup time and key sizes. Their work helps make private querying of large language models faster and less resource-heavy.

two-server secure inferencelarge language modelsfunction secret sharingfixed-point nonlinearitiescompiler optimizationGPU accelerationBERTGPTsecure computationkey generation
Authors
Yuhan Ma, Yong Li, Stefan Schmid
Abstract
Two-server secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GPU systems based on function secret sharing (FSS) make linear layers efficient, but fixed-point nonlinearities and helper operations remain a bottleneck because each operator is typically implemented as a bespoke protocol with its own comparisons, wrap-around corrections, and preprocessing material. We present FuseFSS, a compiler that replaces per-operator protocol design with a single compilation pipeline. For each scalar fixed-point operator, a compact specification lists its interval partition, low-degree arithmetic pieces, and required predicate bits. The compiler emits two batched FSS evaluations on the public masked value: one packed comparison that returns all predicate bits, and one vector interval lookup that returns the active coefficients and constants. Compared to the current state-of-the-art FSS-based GPU secure inference, FuseFSS preserves accuracy while achieving a $1.24\times$--$1.50\times$ end-to-end speedup and reducing online communication by $9\%$--$16\%$ on BERT and GPT-style models; preprocessing is also lighter, with $14\%$--$23\%$ lower key-generation time and $20\%$--$24\%$ smaller keys.