Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

2026-06-15Sound

Sound
AI summary

The authors created Joycent, a new text-to-speech system that can sound like different accents without needing to convert standard speech sounds into accented ones first. Unlike older methods, their system uses a special model to combine accent and speaker features directly, which helps avoid errors from multiple steps. They also built a model called WhisAID to better understand accents in Mandarin speech. Their tests show Joycent makes speech sound more like the intended accent while keeping the speaker's unique voice. The authors have shared their code and examples online.

text-to-speechaccent synthesisdiffusion modelconditional layer normalizationphonemesprosodyspeech representationMandarin accentaccent identification
Authors
Xintong Wang, Ye Wang
Abstract
Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.