Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

2026-06-01Machine Learning

Machine Learning
AI summary

The authors investigate a method called archetypal sparse autoencoders (SAEs), which were claimed to produce more stable features than regular SAEs. They found that this stability was mainly due to using the same starting setup (initialization) each time, not because the method itself was inherently more stable. They make an important distinction between stability (how similar two separately trained models are) and stabilization (whether different runs end up at the same solution). Their results suggest that claims about stable features need careful testing with different starting conditions and tracking how models change during training.

Sparse AutoencodersDictionary LearningOvercomplete BasesPolysemanticityInitializationFeature StabilityMechanistic Interpretabilityk-means ClusteringCosine SimilarityTrajectory Diagnostics
Authors
Michał Brzozowski, Neo Christopher Chung
Abstract
Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random seeds -- a problem known as instability. Archetypal SAEs (Fel et al., 2025) were proposed as a general dictionary-learning intervention for more reliable concept extraction, and report more stable dictionaries at the end of training. We demonstrate that the stability claimed by archetypal SAEs is a result of setting identical initialization across multiple runs. Through our analyses, we attempt to clarify two distinct notions in mechanistic interpretability that may be ambiguously used: stability is agreement between two independently trained models, whereas stabilization is the convergence of independently initialized runs toward a common solution. This distinction is critical for mechanistic interpretability of natural language processing (NLP), where feature stability is increasingly used as evidence that SAE features are reusable units of analysis. Experiments from archetypal SAEs share a deterministic k-means decoder initialization, setting inter-run dictionary distance to zero before training begins. When this initialization is removed, the archetypal constraint provides no stabilization advantage in our setting. We further identify a preprocessing-dependent cosine geometry issue that complicates interpretation of endpoint stability metrics. Overall, our study supports the value of studying SAEs within the larger dictionary-learning tradition while showing that stability claims require trajectory diagnostics and initialization ablations.