Do Sparse Autoencoders Learn Meaningful Concept Hierarchies?

2026-06-22Machine Learning

Machine Learning
AI summary

The authors looked at ways to organize features learned by sparse autoencoders (SAEs) into meaningful hierarchies, which can help understand large models better. They noticed there was no clear standard to decide what makes a good hierarchy, so they created a set of rules and a method to evaluate this. After testing existing SAE methods on images, they found that while the features can support some hierarchy, it's hard to get strong clear structures. They also discovered that a problem called feature absorption, where features blend too much together, tends to ruin the hierarchical quality. This points to a challenge future work needs to solve.

sparse autoencodersunsupervised learningfeature hierarchyconcept discoverysemantic networkstaxonomyfeature absorptionvisual datahierarchical structureevaluation protocol
Authors
Nils Grandien, David Steinmann, Felix Friedrich, Kristian Kersting
Abstract
Sparse autoencoders (SAEs) have become an important tool for unsupervised concept discovery in large models. To make the resulting feature spaces more interpretable and manageable, recent approaches have begun imposing hierarchical structure, either explicitly or as an implicit effect of training constraints, yet rigorous comparison remains difficult. There are no agreed-upon requirements for what a meaningful feature hierarchy should satisfy, and evaluation has largely relied on qualitative illustrations with fragmented quantitative protocols. To address this, we derive a set of key requirements for generalization/specialization hierarchies in unsupervised concept discovery, drawing on semantic net and taxonomy research alongside recent SAE work, and use them to derive a concrete evaluation protocol. Applying this protocol to current SAE approaches trained on visual data, we find that while feature spaces generally provide a basis for sensible hierarchies, establishing good hierarchical structure remains challenging. In particular, feature absorption, both in its well-known hard form and in a continuous, soft form, systematically compromises hierarchy quality, pointing to a fundamental tension that future approaches will need to navigate.