Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

2026-06-22 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors created Bagpiper-TTS, a speech synthesis system that can handle flexible, natural language requests instead of rigid inputs. It first interprets what the user wants, turning that into a detailed text guide that includes both the words and extra details like tone or style. Then, it uses this guide to produce the spoken output. Bagpiper-TTS works for many tasks, such as multiple speakers, role-playing voices, and singing. Tests show it performs very well compared to specialized systems.

Text-to-Speech (TTS)Natural Language ProcessingSpeech SynthesisWord Error Rate (WER)Multi-talker SynthesisIntent RecognitionRole-play SynthesisSinging Voice SynthesisLLM-as-a-judgeSubjective Evaluation

Authors

Jinchuan Tian, Haoran Wang, Siddhant Arora, Takashi Maekaku, Keita Goto, Jin Sakuma, Yusuke Shinohara, Chao-Han Huck Yang, Shinji Watanabe

Abstract

Classical TTS systems typically rely on rigid input formats and predefined metadata slots, limiting their ability to fulfill flexible user requirements. This paper introduces Bagpiper-TTS, a universal speech synthesis system that deals with diverse natural language user requests. Given a natural language prompt, Bagpiper-TTS first reasons over the users' intent to derive a rich caption, i.e., a comprehensive textual blueprint encompassing both transcription and nuanced metadata. Subsequently, this caption guides the synthesis of the target speech. Our model inherently supports a broad spectrum of tasks besides classical TTS applications, including multi-talker, intent-to-speech, role-play synthesis, singing voice synthesis, and more. Experimental results demonstrate that Bagpiper-TTS achieves an 1.7% Word Error Rate (WER) on the Seed-TTS-Eval benchmark and match the performance of dedicated models in both LLM-as-a-judge and human subjective evaluations across multiple applications.

View PDFOpen arXiv