Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization

2026-06-30Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors explore why neural networks tend to memorize training data before they truly learn to generalize. They show that this delay happens because the network's internal representations expand outwards (radial inflation) during training with cross-entropy loss. They propose a way to separate this radial change from angular changes in the network’s activations and test three ideas about how controlling this expansion affects learning. Their experiments using a penalty to limit this radial growth speed up when networks start generalizing, especially on tasks like modular arithmetic and addition. This approach works across different network types including MLPs, Transformers, and GPT models.

neural networksgeneralizationmemorizationcross-entropy optimizationactivation spaceradial-angular decompositionweight regularizationgrokkingmodular arithmeticTransformers
Authors
Srijan Tiwari, Aditya Chauhan, Manjot Singh
Abstract
Why do neural networks memorize algorithmic training data long before they generalize? We present a geometric case study demonstrating that, on tasks where generalization requires discovering structured low-dimensional circuits, the memorization-generalization delay is driven by radial inflation of hidden representations under cross-entropy optimization. We formalize a radial-angular decomposition of activation-space dynamics and derive three testable propositions: (i) that penalizing radial inflation induces anisotropic, data-dependent weight regularization; (ii) that it suppresses radial gradient energy below the isotropic random baseline, forcing predominantly angular updates; and (iii) that it biases convergence toward flatter minima. To empirically validate these propositions, we study a single-hyperparameter norm penalty that softly constrains activations to a sqrt(d)-radius hypersphere. On modular arithmetic, this penalty accelerates grokking up to 6x across MLPs and Transformers, and halves training steps for a 10M-parameter nanoGPT on 3-digit addition.