YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

2026-03-25 • Sound

Sound

AI summaryⓘ

The authors created YingMusic-Singer, a new model that can change the words in a singing voice while keeping the tune the same, without needing tricky manual steps. Their method uses a special type of AI called diffusion models and can work with or without a sample voice. They trained it with advanced techniques and tested it against another model, showing better results in keeping the melody and matching the new lyrics. They also made a new test set called LyricEditBench to measure how well models do this task.

singing voice synthesisdiffusion modelmelody controllyric manipulationtimbre referencecurriculum learningGroup Relative Policy Optimizationmanual alignmentbenchmarkVevo2

Authors

Chunbo Hao, Junjie Zheng, Guobin Ma, Yuepeng Jiang, Huakang Chen, Wenjie Tian, Gongyu Chen, Zihao Chen, Lei Xie

Abstract

Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. The code, weights, benchmark, and demos are publicly available at https://github.com/ASLP-lab/YingMusic-Singer.

View PDFOpen arXiv