Decoding while Adapting: Zero-Shot Online Speaker Adaptation via Audio-Textual Prompts for Elderly Speech Recognition

2026-06-15 • Sound

Sound

AI summaryⓘ

The authors developed a new way to help speech recognition systems better understand elderly speakers by using information from both audio and text from recent speech. Their method adapts to new speakers instantly without needing prior training on them. Tests on English and Cantonese elderly speech showed this approach reduces errors more than models that don’t adapt to individual speakers. Also, their system runs much faster than previous offline methods.

speaker adaptationelderly speech recognitionaudio-textual promptscross-modal fusionembeddingword error rate (WER)character error rate (CER)real-time processingDementiaBank Pitt datasetECAPA-TDNN

Authors

Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Tianzi Wang, Youjun Chen, Huimeng Wang, Haoning Xu, Jiajun Deng, Xunying Liu

Abstract

This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utterances, before being fused in a cross-modal manner to produce compact speaker prompts that are more consistent than i/x-vectors and ECAPA-TDNN features. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed online adaptation outperforms the speaker-independent (SI) model by statistically significant word error rate (WER) or character error rate (CER) reductions of 0.61% and 1.22% absolute (2.99% and 4.48% relative). Real-time factor (RTF) speed-up ratios of up to 9.83 times are obtained over offline batch-mode adaptation.

View PDFOpen arXiv