E-TCAV: Formalizing Penultimate Proxies for Efficient Concept Based Interpretability

2026-05-11Artificial Intelligence

Artificial IntelligenceMachine Learning
AI summary

The authors address problems with TCAV, a method that helps explain what neural networks have learned by connecting network features to human concepts, but which is slow and unstable. They study how different parts of the network and the choice of internal classifiers affect the results, finding that the last layers agree well and that variability mainly comes from the classifier choice. Using these insights, they create E-TCAV, a faster way to estimate TCAV scores by focusing on later layers, making the process quicker without losing much accuracy. Their tests on various models and tasks show that E-TCAV can speed up interpretability work, helping improve model debugging and training. This work is a practical step toward more efficient and stable neural network explanations.

TCAVConcept Activation VectorsNeural Network InterpretabilityLatent ClassifierPenultimate LayerInter-layer AgreementModel DebuggingNatural Language ProcessingComputer VisionReal-time Training
Authors
Hasib Aslam, Muhammad Ali Chattha, Muhammad Taha Mukhtar, Muhammad Imran Malik, Andreas Dengel, Sheraz Ahmed
Abstract
TCAV (Testing with Concept Activation Vectors) is an interpretability method that assesses the alignment between the internal representations of a trained neural network and human-understandable, high-level concepts. Though effective, TCAV suffers from significant computational overhead, inter-layer disagreement of TCAV scores, and statistical instability. This work takes a step toward addressing these challenges by introducing E-TCAV, a framework for efficient approximation of TCAV scores, which is based on extensive investigation into three key aspects of the TCAV methodology: 1) the effect of latent classifiers on the stability of TCAV scores, 2) the inter-layer agreement of TCAV scores, and 3) the use of the penultimate layer as a fast proxy for earlier layers for TCAV computation. To ensure a solid foundation for E-TCAV, we conduct extensive evaluations across four different architectures and five datasets, encompassing problems from both computer vision and natural language domains. Our results show that the layers in the final block of the neural network strongly agree with the penultimate layer in terms of the TCAV scores, and the commonly observed variance of the TCAV scores can be attributed to the choice of the latent classifier. Leveraging this inter-layer agreement and the degeneracy of directional sensitivities at the penultimate layer, E-TCAV guarantees linearly scaling speed-ups with respect to the network's size and the number of evaluation samples, marking a step towards efficient model debugging and real-time concept-guided training.