On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

2026-03-10Machine Learning

Machine Learning
AI summary

The authors study how to design neural network optimizers that work reliably as the network's width grows larger. They view popular optimizers like AdamW and Muon as following steepest descent under certain matrix norms, linking optimizer behavior to the network's smoothness properties independently of width. To handle deep networks better, they introduce new "mean-normalized" norms that allow for layerwise analysis and width-independent smoothness guarantees, leading to practical optimizers such as a rescaled AdamW and new row/column normalized methods. Their proposed optimizer, MOGA, uses these ideas to maintain stable learning rates across widths and performs well on large language models like GPT-2 and LLaMA, being faster than Muon in some cases. Overall, the authors offer a principled way to transfer learning rates in wide networks while controlling optimizer stability.

neural network optimizerAdamWMuon optimizersteepest descentmatrix operator normLipschitz constantsmoothnesslearning rate scalinglayerwise composabilitywidth-independent bounds
Authors
Ruihan Xu, Jiajin Li, Yiping Lu
Abstract
A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width $w$ increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard $p \to q$ operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted $\pmean \to \qmean$, that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emph{rescaled} \textrm{AdamW}, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover $μ$P scaling~\cite{yang2021tensor} as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrm{Muon} can suffer an $\mathcal{O}(\sqrt{w})$ worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.