HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech
2026-06-26 • Computation and Language
Computation and LanguageSound
AI summaryⓘ
The authors address a problem in making speech generated by AI sound more emotional rather than flat and average. They found that mixing content and emotion in one space causes confusion, and that broad feedback is not enough to guide detailed speech. To fix this, they created HPRO, a method that separates content and emotion, and gives feedback at different levels like frames, words, and sentences. Their experiments show this helps the AI produce more expressive and clear speech.
Large Language ModelText-to-SpeechSupervised Fine-TuningProsodyReward OptimizationLatent SpaceEmotional ExpressivenessHierarchical OptimizationDifferentiable Reward ModelSpeech Synthesis
Authors
Sihang Nie, Xiaofen Xing, Rui Xing, Haoming Li, Ruitong Xiao, Jingyuan Xing, Baiji Liu, Xiangmin Xu
Abstract
Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimization offers a promising alternative, existing approaches suffer from two structural mismatches: information conflict, where content and emotion in a shared latent space produce conflicting gradients, leading to reward hacking and semantic degradation; and scale gap, where sparse sentence-level rewards struggle to guide dense frame-level generation. To overcome these challenges, we propose HPRO, a hierarchical progressive reward optimization framework. Within HPRO, we introduce the HD-Emo codec as a novel differentiable reward model to resolve the information conflict. It extracts speech into distinct content and style preference tokens, structurally isolating emotional optimization from semantic content. Building upon this structured preference space, HPRO bridges the scale gap by progressively aligning frame-, word- and sentence-level objectives. Experiments demonstrate that HPRO significantly enhances emotional expressiveness, while effectively preserving linguistic intelligibility. The code and audio samples are publicly available at https://xxh333.github.io/hpro-demo/.