Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning
2026-06-01 • Sound
Sound
AI summaryⓘ
The authors worked on improving speech produced by electrolarynx devices, which often sounds unnatural and hard to understand. They designed a method that combines both speech and text information to better convert this electrolarynx speech into more normal-sounding speech using a special machine learning model. Their approach uses three ways to mix speech and text data and adds extra training steps to improve accuracy. Testing showed their method performs better than previous ones that only used speech data. This research could help make communication easier for people who rely on electrolarynx devices.
Electrolaryngeal (EL) speechSequence-to-sequence (seq2seq) modelsVoice conversion (VC)Representation learningSpeech-to-text integrationAutoencoderReconstruction lossData augmentationAssistive communication devices
Authors
Ding Ma, Jinyi Mi, Fengji Li, Lester Phillip Violeta, Jiajun He, Wenchin Huang, Kazuhiro Kobayashi, Tomoki Toda
Abstract
Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.