Synthesizing the Lombard Effect: Multi-Level Control of Speech Clarity and Vocal Effort in TTS

2026-06-22 • Sound

SoundComputation and Language

AI summaryⓘ

The authors studied how people naturally speak louder and clearer in noisy places, called the Lombard effect. They created a text-to-speech system that can mimic this by controlling how much effort and clarity is in the voice. Their system can adjust speech details even for specific words to make them easier to understand. Tests show their model makes speech clearer and helps people understand spoken words better in noisy situations.

Lombard effecttext-to-speech (TTS)flow matchingvocal effortarticulationword-level emphasisspeech clarityspeech intelligibilityspeech synthesisclear speech

Authors

Seymanur Akti, Alexander Waibel

Abstract

Humans tend to speak louder and clearer in challenging environments, such as noisy conditions or when addressing hearingimpaired listeners, which is called Lombard effect. To simulate this behavior in speech synthesis systems, we introduce a flow-matching based text-to-speech (TTS) model trained with vocal effort and articulation pseudo-labels. The proposed model achieves continuous and disentangled control of vocal effort and articulation, while also enabling word-level emphasis for clarifying specific segments of an utterance. Experimental results show that these control mechanisms effectively improve clarityrelated acoustic features. Furthermore, speech-in-noise experiments demonstrate that our model successfully simulates the intelligibility gains of human clear speech in noisy conditions.

View PDFOpen arXiv