Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis
2026-06-22 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors created Bagpiper-TTS, a speech synthesis system that can handle flexible, natural language requests instead of rigid inputs. It first interprets what the user wants, turning that into a detailed text guide that includes both the words and extra details like tone or style. Then, it uses this guide to produce the spoken output. Bagpiper-TTS works for many tasks, such as multiple speakers, role-playing voices, and singing. Tests show it performs very well compared to specialized systems.
Text-to-Speech (TTS)Natural Language ProcessingSpeech SynthesisWord Error Rate (WER)Multi-talker SynthesisIntent RecognitionRole-play SynthesisSinging Voice SynthesisLLM-as-a-judgeSubjective Evaluation
Authors
Jinchuan Tian, Haoran Wang, Siddhant Arora, Takashi Maekaku, Keita Goto, Jin Sakuma, Yusuke Shinohara, Chao-Han Huck Yang, Shinji Watanabe
Abstract
Classical TTS systems typically rely on rigid input formats and predefined metadata slots, limiting their ability to fulfill flexible user requirements. This paper introduces Bagpiper-TTS, a universal speech synthesis system that deals with diverse natural language user requests. Given a natural language prompt, Bagpiper-TTS first reasons over the users' intent to derive a rich caption, i.e., a comprehensive textual blueprint encompassing both transcription and nuanced metadata. Subsequently, this caption guides the synthesis of the target speech. Our model inherently supports a broad spectrum of tasks besides classical TTS applications, including multi-talker, intent-to-speech, role-play synthesis, singing voice synthesis, and more. Experimental results demonstrate that Bagpiper-TTS achieves an 1.7% Word Error Rate (WER) on the Seed-TTS-Eval benchmark and match the performance of dedicated models in both LLM-as-a-judge and human subjective evaluations across multiple applications.