Decoding while Adapting: Zero-Shot Online Speaker Adaptation via Audio-Textual Prompts for Elderly Speech Recognition
2026-06-15 • Sound
Sound
AI summaryⓘ
The authors developed a new way to help speech recognition systems better understand elderly speakers by using information from both audio and text from recent speech. Their method adapts to new speakers instantly without needing prior training on them. Tests on English and Cantonese elderly speech showed this approach reduces errors more than models that don’t adapt to individual speakers. Also, their system runs much faster than previous offline methods.
speaker adaptationelderly speech recognitionaudio-textual promptscross-modal fusionembeddingword error rate (WER)character error rate (CER)real-time processingDementiaBank Pitt datasetECAPA-TDNN
Authors
Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Tianzi Wang, Youjun Chen, Huimeng Wang, Haoning Xu, Jiajun Deng, Xunying Liu
Abstract
This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utterances, before being fused in a cross-modal manner to produce compact speaker prompts that are more consistent than i/x-vectors and ECAPA-TDNN features. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed online adaptation outperforms the speaker-independent (SI) model by statistically significant word error rate (WER) or character error rate (CER) reductions of 0.61% and 1.22% absolute (2.99% and 4.48% relative). Real-time factor (RTF) speed-up ratios of up to 9.83 times are obtained over offline batch-mode adaptation.