A Unifying Framework for Concept-Based Representational Similarity

2026-06-08Machine Learning

Machine Learning
AI summary

The authors study how different learned models represent concepts in similar ways but find that what 'concept alignment' means is unclear and used inconsistently. They create a clear framework breaking down alignment by what is aligned (representations or concepts) and how (per example or over distributions), defining four key properties. They develop a new benchmark, InterVenchA, to measure these properties separately and show that existing methods don’t reliably achieve all types of alignment, especially at the example-level without supervision. They propose a new model, CoSAE, that combines objectives to achieve strong alignment, and find that even very little paired data helps greatly. Overall, the authors emphasize that concept alignment needs multiple objectives to be properly defined, measured, and optimized.

concept alignmentrepresentation learninginstance-wise alignmentdistributional alignmenttranslation consistencyconcept consistencyautoencoderunsupervised learningbenchmarkpaired data
Authors
Grégoire Dhimoïla, Victor Boutin, Agustin Martin Picard, Thomas Fel, Thomas Serre
Abstract
Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.