Muon Learns More Robust and Transferable Features than Adam
2026-06-08 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors studied a new optimizer called Muon to see how well it helps computers learn features from data compared to popular methods like Adam and SGD. They found that models trained with Muon are better at handling noisy or corrupted images and texts, showing more robustness. Also, Muon-trained features work better when reused for new tasks, meaning they transfer more effectively. The authors support these findings both with experiments on different neural network types and with a theoretical explanation involving concepts like decision margins and feature diversity.
optimizerMuonAdamSGDfeature learningrobustnesstransferabilitylogit margineffective ranklarge language models
Authors
Tianyu Ruan, Fengzhuo Zhang, Shuche Wang, Shihua Zhang
Abstract
Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learning advantage through the lens of robustness and transferability. First, by evaluating pretrained models on corrupted images and texts, we show that features learned by Muon are consistently more robust than those learned by Adam and SGD across different architectures, including transformers and Convolutional Neural Networks (CNNs). Using trained layer-wise probes, we further show that this robustness advantage is reflected in larger logit margins across layers. Second, by training linear classifiers or fine-tuning full models from pretrained parameters on downstream tasks, we demonstrate that Muon-learned features transfer more effectively than those learned by Adam and SGD. This transferability advantage is further supported by the diversity of hidden states across layers, as measured by effective rank. Finally, in a representative classification problem with multi-component features, we prove that Muon attains larger margins and higher effective rank than Adam and SGD, providing theoretical support for our empirical findings.