EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization
2026-05-25 • Machine Learning
Machine Learning
AI summaryⓘ
The authors explain that traditional acceleration methods like Nesterov's momentum can be unreliable in deep learning because they rely on noisy short-term updates. Instead, they suggest looking at longer-term trends in the optimization path to guide updates more smoothly. To do this, they propose EMA-Nesterov, which uses an exponential moving average of past updates to create a steadier direction for parameter changes. Their method keeps strong theoretical performance and shows better results across different optimizers during language model training compared to older lookahead approaches.
Nesterov's momentumlookahead optimizationexponential moving average (EMA)stochastic gradient noisenon-convex lossconvex optimizationgradient descentlanguage model pre-trainingAdam optimizer
Authors
Chung-Yiu Yau, Dawei Li, Athanasios Glentis, Valentyn Boreiko, Hoi-To Wai, Mingyi Hong
Abstract
Lookahead-based acceleration methods, such as Nesterov's momentum, are widely used in optimization, but they often become unreliable in deep learning training mainly due to stochastic gradient noise and non-convex loss landscapes. In particular, standard lookahead relies on short-horizon update signals (e.g., differences between consecutive iterates), which are inherently noisy and can lead to unstable extrapolation directions. This work revisits Nesterov's acceleration from a trajectory perspective and argues that effective acceleration in deep learning should harness the low-frequency trends of optimization trajectories rather than extrapolating noisy one-step updates. Leveraging this insight, we propose EMA-Nesterov, a simple modification that replaces the standard Nesterov's lookahead direction with an exponential moving average (EMA) of parameter updates. This yields a stabilized lookahead direction that captures and harnesses the evolving trend of the training trajectory through a low-pass filter, while remaining adaptive to progressive changes via the geometric weighting structure of EMA. We show that EMA-Nesterov retains a theoretical accelerated convergence rate in convex problems that is analogous to Nesterov's accelerated gradient method. Furthermore, we provide empirical evidence on language model pre-training to verify that EMA-Nesterov is broadly applicable across a range of fine-tuned base optimizers, including Adam, SOAP, Muon, as well as complex optimizers that achieve state-of-the-art performance on optimization benchmarks (NanoGPT). Compared to prior lookahead methods, EMA-Nesterov achieves better performance by avoiding the instability of short-horizon lookahead and the non-adaptivity of long-horizon lookahead.