LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
2026-05-22 • Machine Learning
Machine LearningArtificial IntelligenceInformation Theory
AI summaryⓘ
The authors explain that previous formulas predicting how bigger and longer-trained large language models perform don't capture situations where performance gets worse despite more training. They create a new theory, called the Shannon Scaling Law, which treats model training like sending information through a noisy channel, using ideas from communication theory. This helps explain why too much training or increasing model size without enough quality data can actually hurt performance. They test their theory on different models and tasks, showing it predicts real outcomes better than older methods.
Large Language ModelsScaling LawsShannon-Hartley TheoremSignal-to-Noise RatioInformation TheoryCatastrophic OvertrainingQuantizationModel PerformancePythia ModelsLoss Surface
Authors
Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang, Chen Zheng, Thomas Hartvigsen, Yiyuan Ma
Abstract
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.