Communication-Efficient Verifiable Attention for LLM Inference
2026-06-15 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors look at how to make sure large language models (LLMs) run correctly when parts of the work are done on untrusted hardware like GPUs. Traditional methods that check computations inside secure areas (TEEs) become slow and costly with LLMs. They propose VeriAttn, which cleverly splits work between the secure part and the GPU to speed things up without losing security checks. Tests show VeriAttn is much faster than previous methods, especially for long inputs and outputs.
Large Language Models (LLMs)Trusted Execution Environment (TEE)Transformer AttentionGPU OffloadingComputation IntegrityKey-Value CachePrefill and DecodingSecure ComputationIntel TDXPerformance Optimization
Authors
Ziqun Chen, Ming Wu, Michael Heinrich, Jason Zeng, Huiying Lan, Tianwei Zhang, Rui Tan
Abstract
Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (\textsc{VeriAttn}) for accelerating verifiable LLM inference. \textsc{VeriAttn} offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, \textsc{VeriAttn} uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, \textsc{VeriAttn} partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that \textsc{VeriAttn} achieves 2.60-3.38$\times$ and 3.86-5.42$\times$ acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.