Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework, Axioms, and Future Direction toward Responsible AI

2026-05-11Artificial Intelligence

Artificial IntelligenceComputational Engineering, Finance, and ScienceComputers and SocietyMachine Learning
AI summary

The authors discuss how machine learning models can appear fair based on their outcomes but still be unfair in how they make decisions, which they call procedural bias. They highlight the importance of studying fairness not just in the results but also in the explanations these models provide. Their work introduces a framework that defines explanation fairness by requiring explanations to be independent of protected attributes like race or gender when considering relevant factors. They also review existing research, classify different causes of unfair explanations, and propose steps for evaluating fairness in explanations practically.

machine learningalgorithmic fairnessexplainable AIprocedural biasprotected attributesconditional invariancepost-hoc explainersexplanation fairnessequityevaluation workflow
Authors
Gideon Popoola, John Sheppard
Abstract
Machine learning algorithms are being used in high-stakes decisions, including those in criminal justice, healthcare, credit, and employment. The research community has responded with two largely independent research fields: \emph{algorithmic fairness}, which targets equitable outcomes, and \emph{explainable AI} (XAI), which targets interpretable reasoning. This survey identifies and maps a novel blind spot at their intersection, which is a model that can satisfy every standard fairness criterion in its outputs while being profoundly unfair in its \emph{reasoning process}. We refer to this as the procedural bias, and mitigating it requires treating the fairness of explanations as a distinct object of scientific study. To our knowledge, we provide the first unified theoretical and literature review of this emerging field and elucidate the drawbacks of post-hoc explainers in certifying explanation fairness. Our central contribution is a \emph{conditional invariance framework} formalizing explanation fairness as the requirement that explanations should be indifferent regardless of the protected attributes $ P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = a) = P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = b)$ for all task-relevant $x$, a single principle from which all existing explanation fairness metrics emerge as partial operationalizations. We introduce a seven-dimensional taxonomy, identify three generative mechanisms of explanation inequity (representation-driven, explanation-model mismatch, actionability-driven), and propose a canonical six-step evaluation workflow for operationalizing explanation fairness audits in practice.