When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures
2026-06-01 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors studied how attention heads form and develop inside three different 1-billion parameter language models during training. They found that early layers do not produce a specific type of head (BOS-attractor) and that different models show different patterns when these heads emerge. In some models, circuits related to pattern induction form long before BOS-attractor circuits, indicating different stages in training. They also showed that it is possible to identify important circuits very early on without needing the fully trained model. Overall, the authors clarified the timing and nature of key attention head formations in these language models.
attention headtransformerlanguage modelBOS-attractorinduction circuitcapability selectivitymixture-of-expertstraining dynamicsmechanistic interpretabilityparticipation ratio
Authors
Yongzhong Xu
Abstract
We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture families (dense transformer, mixture-of-experts) and two pretraining corpora (The Pile, DCLM): Pythia 1B, OLMo 1B-0724-hf, and OLMoE 1B-7B-0924. At each of 10 log-spaced revisions per model -- 30 mechanistic-interpretability runs in total -- we apply a participation-ratio (PR) spectral signal and an all-head capability-specific selectivity screen to track induction, previous-token, and BOS-attractor heads as they emerge. Five findings. (F1) Layers 0 and 1 produce zero BOS-classified heads at every revision in every model: the L0/L1 zero-BOS floor is an architectural property, not a learned outcome. (F2) The whole-model BOS-attractor fraction follows three distinct emergence shapes -- a gradual ramp in Pythia 1B, a sharp phase transition in OLMo 1B (7% to 70% between adjacent checkpoints), and a gradual ramp in OLMoE 1B-7B. (F3) In DCLM models, induction-circuit formation precedes BOS-attractor formation by 10-20x in tokens; capability-circuit formation and attention-sink formation are two transitions, not one. (F4) The capability-specific screen converges to the final induction circuit within 0.3-2% of total training tokens -- circuit identification does not require the final model. (F5) For every final-checkpoint induction head sampled across all three models, per-head PR is elevated at or before the first revision at which that head crosses its capability-selectivity threshold. The results refine the induction-phase-transition framing: in 1B-class models trained on DCLM, the induction transition and the attention-sink transition are separated by an order of magnitude in tokens and have qualitatively different shapes.