Explainable AI in Speaker Recognition -- Attention Map Visualisation and Evaluation

2026-06-22Artificial Intelligence

Artificial Intelligence
AI summary

The authors study how neural networks decide which parts of their input to focus on when identifying a speaker's voice. They look at existing ways to create 'attention maps' that highlight important input areas, but notice that evaluating these maps isn't well understood. They review an existing evaluation method, improve it, and call it Modified RISE-eval. Using this new method, they test two popular attention map techniques, GradCAM and LayerCAM, showing each works better in different situations for speaker recognition.

Explainable AINeural NetworksAttention MechanismSpeaker RecognitionClass Activation Map (CAM)GradCAMLayerCAMRISE EvaluationModified RISE-eval
Authors
Yanze Xu, Mark D. Plumbley, Wenwu Wang
Abstract
Explaining and understanding the decision-making process of artificial intelligence (AI) systems, particularly those implemented by neural networks, falls within the field of explainable AI (XAI). Analogous to the human attention mechanism, neural networks are assumed to possess their own attention mechanisms that selectively process information during decision-making. This work proposes to study one XAI topic: analysing and visualising the attention mechanisms of neural networks. Our experiments are performed on speaker recognition neural networks that are trained to identify speaker identity from a given utterance. Previous studies have widely used class activation map (CAM)-based methods to analyse and visualise the attention mechanisms of neural networks. Each of these methods produces an attention map for each network input, highlighting which input regions are selectively processed when the speaker recognition network makes decisions. However, the evaluation of attention maps produced by these methods remains largely underexplored. This work systematically reviews an existing attention map evaluation algorithm, establishing key concepts and identifying its shortcomings. On the basis of this existing evaluation algorithm, a new version is then proposed to address the identified shortcomings, called the Modified Randomised Input Sampling for Explanation - Evaluation algorithm (Modified RISE-eval). Using Modified RISE-eval, we evaluate the attention maps produced by two representative CAM-based methods, GradCAM and LayerCAM, applied to a certain speaker recognition network. The evaluation results demonstrate that GradCAM and LayerCAM each exhibit distinct advantages when applied under different experimental conditions in the speaker recognition task.