MuonSSM: Orthogonalizing State Space Models for Sequence Modeling

2026-06-29Machine Learning

Machine Learning
AI summary

The authors present MuonSSM, a new way to improve state space models (SSMs), which are tools used to handle very long sequences of data efficiently. They focus on making the memory updates in these models more stable by changing how the updates are computed, using ideas like momentum and a mathematical transform called Newton Schulz. Their approach helps the models learn better over long sequences without losing information or becoming unstable. Tests on various tasks in language, vision, and time-series data show that MuonSSM consistently improves performance and robustness compared to other SSM methods.

state space modelssequence modelingmemory updatesNewton Schulz transformationmomentumgradient propagationspectral conditioninglong-context learningstabilityparallel scan complexity
Authors
Thai-Khanh Nguyen, Ngoc-Bich-Uyen Vo, Thieu N. Vo, Tan M. Nguyen, Cuong Pham
Abstract
State space models (SSMs) have emerged as efficient linear-time alternatives to attention for long-sequence modeling. However, existing SSMs often suffer from instability and memory degradation over extended horizons due to poorly conditioned first-order updates and unbalanced update geometry. We introduce MuonSSM, a general framework that stabilizes SSM training by explicitly conditioning the geometry of memory updates rather than the recurrent transition matrix. MuonSSM augments SSMs with a momentum-based pathway and a lightweight Newton Schulz transformation on low-rank input injections, yielding bounded and spectrally conditioned updates while preserving parallel scan complexity. Theory shows that MuonSSM improves gradient propagation, mitigates spectral amplification, and enriches memory representations over long horizons. Extensive experiments across language, vision, and time-series benchmarks show consistent gains in accuracy, robustness, and long-context performance when integrated into diverse SSM backbones. These results establish geometric conditioning of updates as a principled pathway to stable, scalable sequence modeling.