Muown Implicitly Performs Angular Step-size Decay

2026-06-22Machine Learning

Machine Learning
AI summary

The authors studied a way to improve training of Transformer models by focusing on optimizing weight matrices differently. They analyzed an existing method called Muown and found its updates correspond to moving along directions on a sphere, which helps explain why it is stable. Using this insight, they created AngularMuown, a new optimizer that explicitly controls step sizes in these directions, improving performance. Tests on various large models show that AngularMuown scales well and works better than Muown.

Transformerweight matrixoptimizerMuownAdam optimizerRiemannian optimizationstep sizedirectional updatemixture-of-expertsangular step
Authors
Florian Hübler, Kai Lion, Antonio Orvieto, Niao He
Abstract
Matrix-aware optimizers such as Muon and Muown have recently shown strong empirical performance for pre-training Transformers. In particular, Muown separates each weight matrix into row magnitudes and an un-normalized direction variable, updating the former with Adam and the latter with Muon. We show that the directional update of Muown is equivalent to a Riemannian step on the normalized directions, while the magnitude of the un-normalized parameterization only modulates the angular step size. This explains the step-size stability of Muown and suggests making the angular step size explicit. The resulting method, AngularMuown, optimizes directly over the normalized directions and uses a schedulable angular multiplier decoupled from the radial magnitude update. AngularMuown improves over Muown and, at the time of writing, a preliminary version is leading the per-optimizer category of the modded nanoGPT speedrunning competition. Further experiments on Qwen2-0.5B, and 1.1B parameter mixture-of-experts models confirm the algorithm scales beyond small models. An implementation of the algorithm is available at https://github.com/fhueb/angular-muown