UniVocal: Unified Speech-Singing Code-Switching Synthesis

2026-06-01 • Sound

Sound

AI summaryⓘ

The authors introduce UniVocal, a new system that automatically decides when to switch between speaking and singing just by reading the text, without needing special tags. They trained it using a smart two-step learning method and created new data and a test set called SCSBench to help the system learn better. They also improved how the system plans voice tone and melody to make the output sound more natural. Their tests show UniVocal works well for mixing speech and singing and also performs strongly on regular speaking and singing tasks.

Text-to-Speech (TTS)Speech-Singing Code-SwitchingCurriculum LearningProsodySemantic TokenizerChain-of-Thought (CoT) GenerationAcoustic ModelingCode-SwitchingBenchmark Dataset

Authors

Yufei Shi, Qian Chen, Wen Wang, Xiangang Li, Zhen-Hua Ling, Yang Ai

Abstract

We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.

View PDFOpen arXiv