The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

2026-06-22 • Machine Learning

Machine LearningArtificial IntelligenceHardware ArchitectureComputation and LanguageDistributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors developed a way to predict how much energy it takes to train Transformer models, which are used in natural language processing. They tested different BERT model sizes and setups across multiple GPUs to see how energy use relates to things like computing work and memory use. Their method includes a factor to understand how well different types of parallelism (ways to split up the work) improve hardware efficiency. This approach helps accurately estimate energy usage for various training configurations, which is important for saving costs and energy.

Transformer modelsBERTEnergy consumptionGPU parallelismCompute proxyMemory trafficHardware efficiencyScaling lawsTensor parallelismData parallelism

Authors

Mansour Zoubeirou a Mayaki

Abstract

Transformer-based models underpin modern natural language processing but incur rapidly growing computational and energy costs. As training scales in both model size and parallelism, accurately predicting energy consumption has become critical for sustainable and cost-aware system design. We present a framework for modeling the energy consumption of Transformer training on multiple GPUs. Using controlled architectural sweeps of BERT models, we relate measured energy to lightweight proxies for compute, memory traffic, and hardware efficiency. Inspired by roofline models, our approach incorporates a speedup-based hardware-efficiency factor that captures the effects of tensor parallelism and fully sharded data parallelism. We derive a scaling law model that accurately predicts training energy across heterogeneous configurations.

View PDFOpen arXiv