The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

2026-03-05 • Artificial Intelligence

Artificial IntelligenceComputation and Language

AI summaryⓘ

The authors study two special behaviors in Transformer language models called massive activations and attention sinks. Massive activations happen when a few tokens cause very strong signals that stay the same across many layers, acting like hidden settings inside the model. Attention sinks are tokens that grab more focus than usual but only affect nearby parts of the input. The authors found that these two effects often appear together because of how modern Transformers are built, especially due to a design choice called the pre-norm configuration. When this design is changed, the two behaviors stop happening at the same time and work separately.

Transformer language modelsmassive activationsattention sinkshidden representationspre-norm configurationattention headslayer normalizationtoken embeddingsarchitectural artifactshort-range dependencies

Authors

Shangwen Sun, Alfredo Canziani, Yann LeCun, Jiachen Zhu

Abstract

We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

View PDFOpen arXiv