Scaling Adaptive Depth with Norm-Agnostic Residual Networks

2026-06-15 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors found that as deep learning models get deeper, the changes made by later layers become less noticeable because the total signal grows too large. They designed a new model called NAG that keeps track of only the important direction of changes, preventing later layers from being ignored. This change lets very deep models learn better without much extra cost. They also introduced a way to skip some layers dynamically, which saves computing power and lets models train more efficiently without losing accuracy.

residual architecturesdeep learningTransformerresidual streamnorm growthMixture-of-Depthslayer skippingtraining efficiencyFLOPssparsity

Authors

Tomás Figliolia, Beren Millidge

Abstract

Residual architectures are ubiquitous in deep learning, but they suffer from a subtle structural limitation: the norm of the residual stream can grow rapidly with depth. As a result, updates from later layers become small relative to the accumulated residual state. This reduces their impact on the representation and limits the benefits of scaling models in depth. To address this, we introduce NAG, a norm-agnostic residual architecture that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth. Importantly, NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice. We show that this architecture outperforms baseline Transformers, with gains that increase substantially as depth grows, enabling effective training of much deeper models. The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy: under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed. In our experiments, moderate Mixture-of-Depths rates of approximately 20%-25% match full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.

View PDFOpen arXiv