Muon learns balanced solutions in matrix factorization without slow saddle-to-saddle dynamics

2026-06-29 • Machine Learning

Machine Learning

AI summaryⓘ

The authors study how the Muon optimizer behaves differently from standard gradient descent when solving matrix factorization problems. They find that Muon learns all key features simultaneously, avoids common slowdowns, and can safely use much larger learning rates, speeding up training. Additionally, Muon maintains a unique conserved quantity during training, distinct from gradient descent, but both still reach balanced solutions from small starts. The authors also show Muon's early alignment behavior predicts its performance and design a learning rate schedule that achieves near-perfect results in just two steps.

Matrix factorizationGradient descentMuon optimizerLearning rateSaddle pointsLoss sharpnessParameter alignmentBalanced solutionsConserved quantitiesOptimization dynamics

Authors

Mark Rhee, Jamie Simon, Dhruva Karkada

Abstract

Matrix factorization (i.e., problems of the form $\min_{\mathbf{P},\mathbf{Q}} \|\mathbf{M}^\star - \mathbf{P}^\top\mathbf{Q}\|_\mathrm{F}^2$) is a minimal learning problem that exhibits both nonlinear parameter dynamics and representation learning. In this setting, we study how parameter trajectories under the Muon optimizer differ from those of gradient descent. We identify three main dynamical differences: 1) Muon avoids the slow saddle-to-saddle dynamics from small initialization. Muon instead learns all the top modes of $\mathbf{M}^\star$ at the same rate, with the smaller modes converging first. 2) Muon remains stable even when the learning rate exceeds the critical threshold set by the local loss sharpness. This frees the learning rate from the condition number of the problem, enabling rapid convergence via exponential learning rate annealing. 3) Once the weights are aligned with each other and the target, Muon flow conserves the matrix quantity $\sqrt{\mathbf{P}^\top \mathbf{P}}-\sqrt{\mathbf{Q}^\top \mathbf{Q}}$, while gradient flow is known to conserve the matrix $\mathbf{P}^\top\mathbf{P} - \mathbf{Q}^\top\mathbf{Q}$. Despite having distinct conserved quantities, both optimizers find the so-called \textit{balanced} solution from vanishing initialization. When training from small random initialization, the weights spontaneously align early in training. We derive the alignment rates in simple settings and show that they predict the empirical alignment rates in general. Finally, we exploit structural properties of Muon to construct a learning rate schedule that achieves near-perfect alignment in only two optimization steps.

View PDFOpen arXiv