GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference

2026-05-11 • Networking and Internet Architecture

Networking and Internet ArchitectureDistributed, Parallel, and Cluster ComputingInformation TheoryMachine Learning

AI summaryⓘ

The authors address the challenge of running large language models (LLMs) on devices with limited resources by proposing a system called GELATO. GELATO helps decide when parts of the model’s work should be done locally on the device or offloaded to a more powerful edge server, balancing energy use and speed. It uses clever math tools to predict which words need more checking and which can be quickly accepted. Their tests show GELATO improves the speed of generating text tokens while using less energy compared to other methods, without lowering the quality of the output.

Large Language ModelsSpeculative DecodingEdge ComputingResource SchedulingEnergy EfficiencyToken OffloadingGenerative EntropyLyapunov OptimizationThroughputOnline Decision Making

Authors

Zengzipeng Tang, Yuxuan Sun, Wei Chen, Jianwen Ding, Bo Ai

Abstract

The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candidate tokens to be verified by a powerful target model. However, a fundamental challenge lies in achieving per-token resource scheduling to effectively adapt SD paradigm to resource-constrained edge environment. This paper proposes a Generative Entropy- and Lyapunov-based Adaptive Token Offloading framework, named GELATO, to maximize decoding throughput under energy constraints in a device-edge collaborative SD system. Specifically, an outer drift-plus-penalty loop makes online decisions to establish a reference drafting budget, managing long-term energy-throughput trade-off. Further, a nested entropy-driven generation mechanism executes early exiting to adapt to per-token dynamic generative uncertainty. Theoretical analysis establishes a rigorous performance bound on long-term throughput for GELATO. Extensive evaluations demonstrate that GELATO achieves a globally optimal tradeoff, outperforming state-of-the-art distributed SD architectures by 64.98% in token throughput and reducing energy consumption by 47.47% under resource-constrained environments, while preserving LLM decoding quality.

View PDFOpen arXiv