Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?
2026-06-22 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors study AdamW, a popular method for training large language models, focusing on cases where the noise in training gradients has very unpredictable, heavy-tailed behavior. While other optimizers have shown some theoretical guarantees under these tough noise conditions, AdamW's theory is less clear. The authors explore if AdamW still works well when noise is heavy-tailed and find some positive results, but also identify challenges related to how AdamW stores past gradient information. They highlight these issues as an open problem worth further investigation.
AdamWheavy-tailed noisestochastic gradient descentoptimizer convergencesecond-moment accumulatorsign-based optimizersLion optimizerAdaGradgradient noiselarge language models
Authors
Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang
Abstract
AdamW is the de facto optimizer for training large language models (LLMs), yet the theory behind it still lives mostly in finite-variance regimes. This is increasingly unsatisfying, as empirical evidence indicates that stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent work shows that sign-based optimizers such as Lion and Muon achieve sharp heavy-tailed rates, and that AdaGrad can also converge under heavy-tailed noise. However, no rigorous convergence theory for AdamW has yet been established in this regime. Can AdamW converge under the same heavy-tailed assumptions, or does its second-moment accumulator create a genuine obstruction? We formulate this as an open problem, prove a positive weighted-metric benchmark, and give a corridor lower-bound mechanism showing how denominator memory can hide large gradients.