How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

2026-06-01Machine Learning

Machine Learning
AI summary

The authors study Sparse Autoencoders (SAEs), which are tools that break down complex neural data into simpler, understandable parts called concepts. They point out that while SAEs work well in practice, it's not clear what features these models are actually capturing. Instead of relying on simple data assumptions, the authors analyze the mathematical conditions a perfect SAE solution must meet. Using this, they explain various behaviors seen in SAEs and introduce a new approach to better understand and improve these models in the future.

Sparse AutoencoderSparse CodingDictionary LearningLocal OptimalityNonnegativityL1 RegularizationHierarchical SplittingConvex OptimizationNeural RepresentationsJoint Optimization
Authors
William Dorrell
Abstract
Sparse Autoencoders (SAEs) have found success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly SAEs extract, and, correspondingly, the scientific conclusions we can draw from them, are not obvious. Empirically, the proof is in the pudding: SAEs learn interpretable features. Theoretically, we lack a clear account of what properties a 'concept' must satisfy for an SAE to extract it. There has been extensive identifiability work studying the conditions under which sparse coding recovers ground-truth features; however, these approaches tends to focus on simple data-generating models (e.g. sparse independent features) which poorly approximate the internet-swallowing language-model representations on which SAEs are trained. Here, avoiding data-generating models, we ask simply what properties any dictionary learning optimum must satisfy. Concretely, we extend local optimality analyses (Gribonval & Schnass, 2010) to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE features to their distributions. We use these constraints to explain a range of observed SAE behaviours - hierarchical splitting & absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Finally, we construct a novel large-dictionary convex problem and explore the wide atom-per-datapoint limit. In sum, we hope to tease model assumptions from unexpected observations, letting us learn more from SAEs' successes and provide principles for designing their successors.