Assessing Reliability of Symbol Detection in Concept Bottleneck Models

2026-06-15Machine Learning

Machine LearningComputer Vision and Pattern RecognitionSymbolic Computation
AI summary

The authors investigate Concept Bottleneck Models (CBMs), which explain AI decisions using understandable symbols called concepts. They show that even if a CBM performs well, it might rely on incorrect or unreliable concepts, making its explanations untrustworthy. By testing how well parts of these models can be swapped, the authors identify which concepts are unreliable. They then propose a new training method that encourages the model to depend less on these unreliable concepts, improving the model's explanation reliability. Their experiments demonstrate that this approach helps reduce errors caused by misleading concepts.

Concept Bottleneck ModelsExplainable AIConcept detectionModel reliabilitySymbolic vocabularyTraining strategiesSwap accuracyTask accuracyConcept supervisionSpurious correlations
Authors
Javier Fumanal-Idocin, Javier Andreu-Perez
Abstract
Concept Bottleneck Models (CBMs) are a relevant tool for explainable Artificial Intelligence because they make their predictions through human-interpretable symbols. However, high task accuracy does not guarantee that these symbols are detected faithfully: jointly trained CBMs may encode task-specific shortcuts in the bottleneck, making their explanations unreliable. In this paper, we study concept-detection reliability by swapping independently trained concept detectors and classification heads that share the same symbolic vocabulary. We use the resulting performance degradation, concept-level metrics, and symbol-wise uncertainty estimates to identify concepts that are especially prone to spurious firing. Finally, we propose a reliability-aware training strategy in which a shared concept detector is optimized with multiple classification heads and penalized for relying on globally or instance-wise unreliable symbols. On CUB-200-2011 with full concept supervision, detectors and heads are almost freely interchangeable (swap drop below one accuracy point, relative retention above $99\%$, and no concept detected below chance), whereas on a controlled synthetic task we show that, as the concept-supervision weight is reduced, models keep near-perfect task accuracy while swapped accuracy and agreement with the ground-truth concepts collapse to chance. Our reliability-aware training substantially mitigates this leakage, roughly doubling swap accuracy in the leaky regime.