Generalization at the Edge of Stability

2026-04-21Machine Learning

Machine LearningArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors study why training neural networks with big learning rates—where the training process is wobbly and chaotic—tends to help with learning better but is not well understood. They model the training process as a random system that settles into a complex fractal shape instead of a fixed point. Using this idea, they define a new way to measure complexity called 'sharpness dimension' and show it can predict how well the network will generalize. Their results reveal that lots of detailed information about the network’s curvature matters, not just simple measures used before. They also test their ideas on different models and connect them to a phenomenon called grokking.

neural networkslearning ratestochastic optimizationrandom dynamical systemsfractal attractorsharpness dimensiongeneralizationHessian matrixLyapunov dimensiongrokking
Authors
Mario Tuci, Caner Korkmaz, Umut Şimşekli, Tolga Birdal
Abstract
Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.