From Data Statistics to Feature Geometry: How Correlations Shape Superposition

2026-03-10Machine Learning

Machine LearningArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors study how neural networks store many features by overlapping them, a concept called superposition. They argue that previous explanations assumed features were independent and sparse, which misses some important effects seen with real data like language. By creating a new test case called BOWS, they show that when features are related, overlaps can help build useful feature groups rather than just causing noise. Their work helps explain certain patterns found in language models that older theories did not cover.

mechanistic interpretabilitysuperpositionneural networksfeature correlationsparse autoencodersReLUweight decaysemantic clustersbag-of-wordsinterference
Authors
Lucas Prieto, Edward Stevinson, Melih Barsbey, Tolga Birdal, Pedro A. M. Mediano
Abstract
A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non-linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co-activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition. Code for this paper can be found at https://github.com/LucasPrietoAl/correlations-feature-geometry.