The Unseen Hand: Manipulating Model Fairness and SHAP with Targeted Identity Re-Association Attacks
2026-06-22 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors show that methods used to check if machine learning models are fair and explainable can be tricked. They created new attacks called TIRA that gently change a model’s outputs without needing to look inside the model. These attacks make the model seem fairer than it really is and hide the role of sensitive features when explanations are generated. This is better than older attacks because it leaves fewer clues. Overall, the authors reveal a weakness in current fairness checks and explanation tools.
algorithmic fairnessmodel explainabilitymachine learning auditingSHAP explanationsadversarial attacksmodel output manipulationprotected featuresTIRA attacksProbabilistic Micro-ShufflingProbabilistic Rank-Shift Micro-Perturbation
Authors
Sannaan Khan, Muhammad U. S. Khan
Abstract
As machine learning models grow more influential and opaque, algorithmic fairness and explainability are critical for ensuring accountability. However, we demonstrate that these auditing mechanisms are themselves vulnerable to subtle manipulation, camouflaging the influence of protected features. While prior work on data-agnostic attacks has exposed this vulnerability, they leave behind detectable artifacts that compromise their stealth. We introduce Targeted Identity Re-Association (TIRA) attacks, a novel family of attacks that iteratively and probabilistically manipulate a model's outputs without requiring access to the model's internals or feature representations. We formalize two algorithms: Probabilistic Micro-Shuffling (PMiS), which applies localized adjacent swaps, and Probabilistic Rank-Shift Micro-Perturbation (PRSMP), which introduces small, randomized rank shifts. We empirically demonstrate that TIRA attacks are highly effective at pushing fairness metrics towards ideal values. Crucially, TIRA attacks successfully confound SHAP-based explanations, leaving effectively zero residual attribution for protected features, a major improvement over prior work.