NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

2026-06-08Computation and Language

Computation and Language
AI summary

The authors studied Nüshu, a rare script used by women in southern China, to create the first text-to-speech system that can speak it. They faced challenges because there are very few recordings, and those are mostly short sounds instead of full sentences. To solve this, they made a new dataset linking Nüshu text, phonetic guides, translations, and old recordings. They also developed a special model called Nüshu-PitchVITS that uses pitch information to better mimic how Nüshu sounds. Their system produced clearer and more natural speech compared to existing methods, and they shared their data and code publicly.

Nüshutext-to-speechphonetic scriptpitch notationlow-resource languagespeech synthesisVITSprosodyUnicodeacoustic reconstruction
Authors
Hongkun Yang, Xinhui Yi, Xiyan Zhao, Yibo Meng, Lionel Z. Wang, Lixu Wang, Yaqi Zhang, Ruiqi Chen, Xuanyue Zhao, Lanxin Zhang, Yu Zeng, Weijia Chu, Yiming Ma, Chenyu Liu, Jianghao Lin, Xin Xu
Abstract
Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a Nüshu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated syllable-level pronunciations rather than natural sentence-level utterances. In this work, we introduce NüshuVoice, the first TTS benchmark for Nüshu. We construct a sentence-level Nüshu text-to-audio dataset that aligns standardized Unicode Nüshu text, phonetic transcriptions, standard Chinese translations, and archival recordings. To synthesize speech under this extreme low-resource setting, we propose Nüshu-PitchVITS, an F0-conditioned VITS framework that leverages Nüshu's five-level pitch notation as an explicit prosodic inductive bias. Experimental results show that Nüshu-PitchVITS outperforms strong TTS baselines in spectral fidelity, pitch reconstruction, and human-rated intelligibility. We publicly release the dataset and code at: https://anonymous.4open.science/r/Nvshu-TTS-2EB6.