A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning
2026-05-25 • Sound
SoundMachine Learning
AI summaryⓘ
The authors developed a computer program that can help detect Alzheimer's disease by listening to a person's speech and reading the transcript of what they say. They used advanced models to understand both the sounds and the words, and then combined these understandings in a way that makes the two work better together. Their method was tested on public datasets and showed good results in identifying signs of dementia from speech. This approach helps in making early detection of Alzheimer's more accurate using spoken language.
Alzheimer's diseasedementia detectionmultimodal learningspeech processingHuBERT modelBERT modelmutual informationattention mechanismacoustic featuresnatural language processing
Authors
Loukas Ilias, Dimitris Askounis
Abstract
Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.