FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

2026-06-22 • Sound

Sound

AI summaryⓘ

The authors introduce FlowTTS-GRPO, a new way to improve text-to-speech systems using reinforcement learning on flow-matching models, which were less studied before. They transform certain mathematical paths to help fine-tune speech models directly without extra tools. Their method trains faster and makes speech sound more like the target speakers, with clearer and better-quality audio. They also found some practical tricks, like skipping specific guidance during training and focusing on harder cases, to make the learning process better.

Reinforcement LearningText-to-Speech (TTS)Flow-MatchingOrdinary Differential Equation (ODE)Stochastic Differential Equation (SDE)Classifier-Free Guidance (CFG)Speaker SimilarityPerceptual QualityIntelligibility

Authors

Haoxu Wang, Biao Tian, Weiqing Li, Xiang Lv, Han Zhao, Xiangang Li

Abstract

Existing Reinforcement Learning (RL) research for Text-to-Speech (TTS) focuses on large language models (LLMs), leaving Flow-Matching (FM) under-explored. We present FlowTTS-GRPO, an online RL framework for FM-based TTS. By converting ordinary differential equation (ODE) trajectories into stochastic differential equation (SDE) paths, our method enables direct fine-tuning of open-source FM models without auxiliary models. We show that a weighted reward combination converges faster than a probabilistic scheme, and identify three practical optimizations: omitting classifier-free guidance (CFG) during training accelerates convergence; synthesizing hard cases improves robustness; and applying RL to the FM component enhances audio-detail metrics. Experiments on CosyVoice 3.0 and F5-TTS demonstrate objective and subjective preference gains in speaker similarity and perceptual quality, with F5-TTS also improving intelligibility.

View PDFOpen arXiv