CacheMuon: Using Temporal Preconditioning To Approximate Polar Factor

2026-06-15Machine Learning

Machine Learning
AI summary

The authors study Muon, an optimizer that uses a special matrix calculation called the polar factor to update training steps. Normally, calculating this factor every time is costly, but since the changes are smooth over time, the authors propose CacheMuon, which reuses previous results to save work. They show that this saves computation while keeping performance close to the original method. The balance between speed and accuracy can be controlled by how much old information is reused.

optimizerpolar factormomentum matrixNewton-Schulz iterationsingular value decompositionorthogonalizationtemporal correlationpreconditioninglanguage modelingvision training
Authors
Bishnu Dev, Sushil Bohara, Martin Takáč, Samuel Horváth
Abstract
Muon is an optimizer that computes updates using the polar factor of the momentum matrix and has shown strong empirical performance across a range of training settings. A key component of Muon is the Newton-Schulz iteration used to compute this polar factor. Although this avoids the cost of an exact singular value decomposition, it remains expensive in practice because it is applied at every optimization step. At the same time, the momentum matrix changes smoothly over training, suggesting strong temporal correlation in the corresponding polar factors. In this paper, we exploit this structure and propose CacheMuon, a temporal preconditioning method that reuses information from previous optimization steps to approximate the polar factor at the current step. This reduces redundant orthogonalization computation across iterations. We analyze CacheMuon as an inexact Muon update, with error controlled by fresh-solver error and cache staleness. Empirically, CacheMuon provides a controllable quality-efficiency frontier: conservative thresholds closely match fresh Muon on language-model and vision training while reducing orthogonalization FLOPs, whereas more aggressive thresholds yield larger arithmetic savings at the cost of modest validation-quality degradation.