Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings
2026-04-24 • Machine Learning
Machine LearningArtificial IntelligenceHuman-Computer Interaction
AI summaryⓘ
The authors studied different ways to explain AI decisions using Shapley values, which help show why AI makes certain choices. They tested eight versions of these explanations on risk and fraud detection tasks with real analysts. Their findings show that usual ways to measure explanation quality don’t match how helpful or clear humans find them. Also, explanations made analysts more confident but didn’t improve how well they performed, which could lead to over-relying on AI. The authors suggest better evaluation methods are needed to understand how explanations actually impact human decisions.
Shapley valuesExplainable AIRisk assessmentFraud detectionHuman-AI interactionAutomation biasExplanation evaluationQuantitative metricsFaithfulnessSparsity
Authors
Inês Oliveira e Silva, Sérgio Jesus, Iker Perez, Rita P. Ribeiro, Carlos Soares, Hugo Ferreira, Pedro Bizarro
Abstract
Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.