Communication-Efficient Verifiable Attention for LLM Inference

2026-06-15 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors look at how to make sure large language models (LLMs) run correctly when parts of the work are done on untrusted hardware like GPUs. Traditional methods that check computations inside secure areas (TEEs) become slow and costly with LLMs. They propose VeriAttn, which cleverly splits work between the secure part and the GPU to speed things up without losing security checks. Tests show VeriAttn is much faster than previous methods, especially for long inputs and outputs.

Large Language Models (LLMs)Trusted Execution Environment (TEE)Transformer AttentionGPU OffloadingComputation IntegrityKey-Value CachePrefill and DecodingSecure ComputationIntel TDXPerformance Optimization

Authors

Ziqun Chen, Ming Wu, Michael Heinrich, Jason Zeng, Huiying Lan, Tianwei Zhang, Rui Tan

Abstract

Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (\textsc{VeriAttn}) for accelerating verifiable LLM inference. \textsc{VeriAttn} offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, \textsc{VeriAttn} uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, \textsc{VeriAttn} partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that \textsc{VeriAttn} achieves 2.60-3.38$\times$ and 3.86-5.42$\times$ acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.

View PDFOpen arXiv