Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

2026-05-29Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied how a specific type of language model called a Transformer learns to solve tasks needing different kinds of reasoning. They trained the model on two similar tasks: one involving numbers and positions, and the other involving symbolic letters. They found that certain parts of the model, called attention heads, specialize either in positional or symbolic reasoning, and this specialization is linked to better learning. The authors also showed that symbolic reasoning parts handle longer input sequences better than positional ones, which have limitations when sequences get longer.

TransformerAttention headsMulti-hop reasoningPositional reasoningSymbolic reasoningRoPEQuery-key-valueExtrapolationSequence lengthDecoder-only Transformer
Authors
Felipe Urrutia, Juan José Alegría, Cinthia Sanchez Macias, Jorge Salas, Cristian B. Calderon, Cristobal Rojas
Abstract
Transformer-based language models are widespread in today's society. As such, understanding the mechanisms by which they solve structured tasks and predicting how they may behave in novel scenarios is of great importance for safe deployment. We study the learning dynamics of attention heads in a controlled setting by training a decoder-only Transformer (GPT-J) on two structurally equivalent multi-hop reasoning tasks: a number task requiring positional reasoning and a letter task requiring symbolic reasoning. Using a recently introduced metric that classifies attention-head behavior as positional or symbolic for a given prompt, we show that successful learning is associated with the emergence of pure heads, i.e., heads that express themselves as either positional or symbolic. Despite the tasks' structural equivalence, they impose different mechanistic demands: the number task requires both positional and symbolic heads, whereas the letter task requires only symbolic heads. We then identify the computational roles of these heads, characterize the basic functions they implement, and give theoretical constructions showing how single-layer RoPE-based attention can realize these functions through geometrically interpretable query, key, and value operations. This analysis yields a quantitative separation between positional and symbolic mechanisms in their robustness to longer sequences, formalized through a novel notion of discrepancy. We empirically validate the resulting predictions in both controlled and real-world models, showing that symbolic mechanisms extrapolate more reliably to longer sequences while positional mechanisms face sharper limitations.