CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

2026-05-25 • Sound

Sound

AI summaryⓘ

The authors studied how to improve speech editing, which requires the new speech to match the existing recording closely. They found that previous methods using supervised learning were limited by imperfect data. To solve this, they created CosyEdit2, a two-step training method that first learns basic editing and then fine-tunes using a special optimization technique without needing exact target speech. Their experiments show that CosyEdit2 improves both speech editing and zero-shot text-to-speech tasks, suggesting these two are closely linked.

Speech editingText-to-Speech (TTS)Supervised Fine-Tuning (SFT)Group Relative Policy Optimization (GRPO)Zero-shot learningAcoustic consistencyPost-trainingGenerative models

Authors

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yongchang Gan, Yong Qin

Abstract

Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.

View PDFOpen arXiv