Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

2026-03-18 • Machine Learning

Machine Learning

AI summaryⓘ

The authors introduce MUD, a new technique to improve the speed of training transformer models by making optimizer updates simpler and faster. MUD replaces an existing complex step from Muon (which uses polar decomposition) with a faster method inspired by classic math algorithms like Gram--Schmidt and Gauss-Seidel. Although MUD may take slightly more steps to train, it reduces the total training time significantly by cutting down the computational overhead. Their results show that MUD speeds up training on various models, including GPT-2 and a protein language model, while maintaining similar accuracy.

Orthogonalized-momentum optimizersPolar decompositionCholesky decompositionGram--Schmidt processGauss-Seidel methodTransformer trainingMatrix whiteningPerplexityAdamW optimizerProtein language model

Authors

Ben S. Southworth, Stephen Thomas

Abstract

Orthogonalized-momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix-valued momentum updates via a short polar-decomposition iteration. However, polar-factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware-dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon's polar update with a triangular (Cholesky-like) whitening surrogate inspired by classical Gram--Schmidt and Gauss-Seidel ideas. We show that row-orthonormal matrices are fixed points of the MUD map, relate the inner step to symmetric Gauss-Seidel preconditioning of the Gram matrix, and prove quadratic local convergence near the fixed point. In terms of time-to-perplexity, MUD yields consistent 10-50\% wall-clock improvements over tuned AdamW and Muon in time-to-perplexity, typically converging slightly slower per step than Muon but with substantially lower optimizer overhead -- relative to Muon, MUD improves peak tokens/s by roughly $1.3-2.6\times$ across most settings and up to nearly $3\times$ on GPT-2 large on an A100. We also demonstrate training a ESM-2 150M protein language model, where MUD matches Muon-level validation perplexity in significantly less wall-clock time.

View PDFOpen arXiv