When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations

2026-06-15Machine Learning

Machine LearningComputer Vision and Pattern Recognition
AI summary

The authors address the problem that deep neural networks in medical imaging can mistakenly trust unfamiliar or unusual inputs, which is risky for clinical use. They create a new method to detect such out-of-distribution (OOD) cases by checking how stable the model's predictions are when parts of its internal understanding, related to known class features, are slightly changed. They use sparse autoencoders to break down complex data into meaningful concepts and see if the model’s output changes a lot when these concepts are tweaked. If the prediction is stable, the input is likely familiar, but if it changes a lot, the input might be unfamiliar or OOD. This method also helps explain why the model is uncertain, which is important for medical safety.

Deep neural networksMedical imagingOut-of-distribution detectionSparse autoencodersClass-conditioned perturbationsModel uncertaintySemantic representationsLogits stabilityInterpretabilityDistributional shift
Authors
Anju Chhetri, Pratik Shrestha, Ramesh Rana, Prashnna Gyawali, Binod Bhattarai
Abstract
Deep neural networks have achieved remarkable performance across medical imaging tasks, yet their tendency to overgeneralize under distributional shifts poses a major obstacle to safe clinical deployment. Out-of-Distribution (OOD) detection methods aim to mitigate this risk, but most existing approaches rely on opaque internal signals with poorly understood semantic meaning, limiting trust in safety-critical settings. In this work, we propose an interpretable OOD detection framework that probes the stability of model predictions under class-conditioned semantic perturbations. Leveraging sparse autoencoders (SAEs), we learn class-specific concept vectors from in-distribution data that disentangle dense intermediate representations into sparse, semantically meaningful components. At inference, we perturb deeper-layer representations using the concept vectors associated with the model's predicted class and measure the class logits stability. We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal and an interpretable lens into the internal mechanisms driving model uncertainty, making it particularly suitable for high stakes medical applications.