CraBERT: Efficient Phoneme Encoder Pre-Training via Cascade Fusion of Subword Representations for Text-to-Speech

2026-06-15Sound

Sound
AI summary

The authors developed CraBERT, a special tool for text-to-speech systems that helps convert written text into speech sounds more quickly and naturally. Their approach links word-level knowledge from an existing language model with the sound-level details of speech, so it needs less training time to work well. Tests with listeners showed that CraBERT sounds as good as other models but trains much faster. This means CraBERT can help make computer-generated voices sound more natural without requiring as much preparation.

Text-to-Speech (TTS)Phoneme EncoderPre-trainingBERTSubwordPhonemeCascade-fusion ArchitectureProsodyMean Opinion Score (MOS)
Authors
Dong Yang, Yuki Saito, Wataru Nakata, Hiroshi Saruwatari
Abstract
This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level BERT into a phoneme-level BERT. This design provides prior word- and sentence-level information, reducing the amount of pre-training required by the phoneme encoder. Subjective listening evaluations show that CraBERT achieves MOS values comparable to existing PPEncs after approximately one epoch of pre-training, whereas the baselines in our comparison are pre-trained for approximately ten epochs. These results demonstrate that CraBERT can efficiently learn representations suitable for improving the perceived naturalness and prosody of synthesized speech.