EnerInfer: Energy-Aware On-Device LLM Inference

2026-06-22Software Engineering

Software Engineering
AI summary

The authors study running large language models (LLMs) directly on devices like phones and laptops, focusing on saving energy and reducing heat instead of just speeding things up. They find that lowering certain hardware speeds slightly can keep the experience good while using less power and generating less heat. To make this work in real life, they create EnerInfer, a system that predicts and manages the best settings for different models and conditions without needing lots of sensors or profiles. Tests show EnerInfer can make these devices much more energy-efficient without making the user notice a drop in quality.

on-device inferencelarge language modelsenergy efficiencythermal managementNPU frequencyDDR memory frequencyquality of experiencethroughputruntime optimizationonline feedback control
Authors
Bohua Zou, Nian Liu, Binqi Sun, Matteo Mascherin, Debayan Roy, Yutao Liu, Yu Peng, Ning Jia, Haibo Chen
Abstract
On-device LLM inference is increasingly attractive for privacy-preserving, reliable, and cost-effective deployment, yet its energy and thermal costs remain a critical bottleneck. Existing systems primarily optimize for decoding speed, implicitly assuming that faster execution is always preferable. We show instead that on-device LLM inference often has exploitable configuration slack: modestly lowering NPU and memory frequencies preserves quality of experience (QoE) while substantially improving energy efficiency and reducing heat. Realizing this opportunity in production is challenging. The most energy-efficient NPU/DDR setting varies with the model, inference engine, platform, and runtime conditions, with no stable ranking across configurations. Commercial devices further lack component-level power sensing, and shell temperature evolves with request arrivals, response lengths, and thermal history. To address these challenges, we propose EnerInfer, the first on-device LLM inference framework that jointly manages energy efficiency, throughput, and thermal comfort for LLM workloads. EnerInfer replaces per-model profiling and sensor-heavy control with disaggregated, model-structure-aware prediction and ranking-driven online feedback. It predicts throughput and power for unseen LLMs across NPU/DDR frequency settings, selects QoE-satisfying efficient configurations under runtime interference, and uses lightweight limited-horizon thermal prediction to dynamically switch between energy-optimized and thermally constrained inference. Evaluations on real-world LLMs show that EnerInfer improves energy efficiency by up to 65%, 12%, and 24% on phones, a laptop, and a development board, respectively, without QoE violation.