On the Faithfulness of Post-Hoc Concept Bottleneck Models

2026-06-29Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors study how models explain decisions using understandable concepts, like identifying a bird by its belly color. They show that just looking at task accuracy doesn’t guarantee these concepts actually make sense, as random concepts might still predict well. They find two main problems: concepts learned from other data may not transfer well, and labels made by vision-language models can be noisy and misleading. To fix this, they create new ways to check if concepts truly represent what they should, beyond just accuracy. Their tests on various datasets prove these new checks catch issues that accuracy misses.

Post-Hoc Concept Bottleneck Modelslatent featuresconcept projectionscovariate shiftlabel noisevision-language modelspredictive accuracyconcept faithfulness
Authors
Laines Schmalwasser, Jan Blunk, Niklas Penzel, Julia Niebling, Joachim Denzler
Abstract
Human decision-making interprets the world through high-level concepts, such as recognizing a bird by its belly color. To bridge the gap between opaque deep learning representations and human understanding, Post-Hoc Concept Bottleneck Models (post-hoc CBMs) project latent features onto interpretable concept spaces using auxiliary datasets or vision-language models. However, relying on target task accuracy as the primary measure of post-hoc CBM success obscures whether the learned concepts are semantically meaningful or merely predictive artifacts. For example, random concept projections can achieve competitive accuracy despite being semantically meaningless. In this work, we analyze the learned projections directly and identify two failure cases: First, for concept projections learned from auxiliary data, covariate shifts can lead to unfaithful concept representations for the target task. In particular, we provide an upper bound on the error introduced by this shift. Second, systematic label noise in surrogate concept labels generated by vision-language models leads to unfaithful projections. After formalizing these failure modes, we introduce novel metrics that decouple concept faithfulness from predictive accuracy. Our empirical results across real-world and synthetic benchmarks confirm that these metrics identify unfaithful behaviors that standard accuracy-based evaluation fails to detect.