Scaling the Memory of Balanced Adam

2026-05-11Machine Learning

Machine Learning
AI summary

The authors studied a simpler version of the Adam optimizer where two momentum parameters are combined into one, called balanced Adam. They found that this single parameter, beta, should be understood as controlling how long the optimizer 'remembers' past data rather than being a fixed constant. By linking beta to a concept called the 'refresh count,' which measures how often the optimizer updates its statistics during training, they improved the optimizer's robustness across different tasks. Their approach reduced errors significantly compared to a fixed beta value, suggesting that tuning beta by considering memory and training duration is a better strategy.

Adam optimizermomentum parametersbalanced Adambeta parameterstatistical memory horizonrefresh countlearning horizonvalidation trajectoryoptimizer robustnesshyperparameter tuning
Authors
Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí
Abstract
Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, $β_1=β_2$, reducing the optimizer to a single remaining parameter. However, the value of this parameter is still poorly understood. We argue that, in balanced Adam, $β$ should not be treated as a dimensionless constant: it defines a statistical memory horizon $H_β=(1-β)^{-1}$. In terms of the effective learning horizon $T_{\mathrm{ES}}$, estimated from the validation trajectory, we study the refresh count $R_β=(1-β)T_{\mathrm{ES}}$, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing $β$ so that $R_β\approx1000$ selects different beta values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice $β=0.94377$, the refresh rule improves worst-case robustness, reducing the global maximum validation gap by $33.4\%$, while bringing all 11 runs within $1\%$ of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is better understood as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.