AI summaryⓘ
The authors studied how machine learning models that perform similarly can behave very differently when trying to find small changes to inputs that flip their predictions, which is called counterfactual explanation. They focused on systems that use pretrained encoders and simple classifiers and found that changing just the classifier can greatly affect the ease of finding these counterfactuals without changing accuracy. This happens because where the decision boundary sits relative to the data affects if and how these small input changes are possible. Their work shows that counterfactual behavior is a separate aspect of models beyond just accuracy, which matters for understanding and improving model explanations.
counterfactual explanationspretrained encoderclassifier headdecision boundaryrepresentation spacepredictive performancelocal data supportmachine learning interpretabilitymultimodal systems
Authors
Ioanna Gemou, Matteo Gamba, Randall Balestriero, Ritambhara Singh
Abstract
Counterfactual explanations seek small, semantically meaningful changes to an input that alter a model's prediction, and are widely used to interpret and audit machine learning systems. In modern vision, language, and multimodal systems, pretrained encoders map inputs to representation spaces, and downstream classifier heads impose decision boundaries within those spaces. As a result, the feasibility and distance of nearby counterfactuals depend on boundary placement relative to the data. Yet models with similar predictive performance can differ substantially in whether such changes are achievable and how far representations must move. This work examines this variation using a standardized local search probe across several pretrained encoders and linear classifier heads. Results show that despite similar predictive performance, models differ substantially in their counterfactual behavior. Under fixed representations, varying only the classifier head alters counterfactual outcomes while leaving predictive performance largely unchanged. This variation is explained by the interaction of decision-boundary proximity and local data support, which jointly determine whether prediction changes are both feasible and lie in regions supported by the data, and can also improve counterfactual search within fixed models. Together, these findings identify counterfactual behavior as a distinct dimension beyond predictive performance and show that it can be altered without changing accuracy, with implications for model selection, robustness, and the reliability of counterfactual methods.