Synthesizing the Lombard Effect: Multi-Level Control of Speech Clarity and Vocal Effort in TTS
2026-06-22 • Sound
SoundComputation and Language
AI summaryⓘ
The authors studied how people naturally speak louder and clearer in noisy places, called the Lombard effect. They created a text-to-speech system that can mimic this by controlling how much effort and clarity is in the voice. Their system can adjust speech details even for specific words to make them easier to understand. Tests show their model makes speech clearer and helps people understand spoken words better in noisy situations.
Lombard effecttext-to-speech (TTS)flow matchingvocal effortarticulationword-level emphasisspeech clarityspeech intelligibilityspeech synthesisclear speech
Authors
Seymanur Akti, Alexander Waibel
Abstract
Humans tend to speak louder and clearer in challenging environments, such as noisy conditions or when addressing hearingimpaired listeners, which is called Lombard effect. To simulate this behavior in speech synthesis systems, we introduce a flow-matching based text-to-speech (TTS) model trained with vocal effort and articulation pseudo-labels. The proposed model achieves continuous and disentangled control of vocal effort and articulation, while also enabling word-level emphasis for clarifying specific segments of an utterance. Experimental results show that these control mechanisms effectively improve clarityrelated acoustic features. Furthermore, speech-in-noise experiments demonstrate that our model successfully simulates the intelligibility gains of human clear speech in noisy conditions.