The Quantization Benefits of Residual-Free Transformers

2026-05-25 • Machine Learning

Machine Learning

AI summaryⓘ

The authors studied why reducing the number of bits used to store transformer model data (quantization) often damages performance. They found that the common feature called residual connections causes the activations to behave unpredictably, making quantization less accurate. By comparing models with and without residual connections, they showed that removing these connections and using special training methods helps keep activations more predictable and makes the model work better with fewer bits. This suggests a trade-off between model design for accuracy and for efficient compression.

TransformerResidual ConnectionsQuantizationActivationsGaussianityKurtosisOrthogonal InitializationSpectral OptimizationAttention TemperatureLow-bit Quantization

Authors

Yiping Ji, Mahalakshmi Sabanayagam, Peyman Moghadam, Hemanth Saratchandran, Simon Lucey

Abstract

Large-scale transformer training and deployment are increasingly constrained by the transfer of activations, gradients, and optimizer states across accelerators. Low-bit quantization offers a natural remedy, but transformer activations are often heavy-tailed and outlier-dominated, making simple quantization highly lossy. We show that this difficulty is not only a property of the quantizer, but also of the architecture. Specifically, residual connections can drive transformer activations away from Gaussianity during training. Using controlled comparisons between residual and residual-free transformers, we demonstrate that this effect leads to substantially higher quantization error and accuracy degradation at low precision in residual models. We explain the phenomenon through an excess kurtosis analysis, showing that residual mixing can amplify non-Gaussianity, whereas dense mixing in residual-free contracts non-Gaussianity. We then show that residual-free transformers can be made trainable using orthogonal initialization, spectral or second-order optimization, and depth-aware scaling of attention temperature. In language tasks, while there is a small drop in full precision performance, these models retain near-Gaussian activations and exhibit significantly improved robustness to low-bit quantization. Our results identify an accuracy--compressibility trade-off in transformer design and motivate architecture-level approaches to quantization-friendly foundation models.

View PDFOpen arXiv