MeloDISinger: Melody-Aware & Duration-Preserving Singing Voice Editing with Audio Infilling

2026-06-29 • Sound

Sound

AI summaryⓘ

The authors developed MeloDISinger, a tool that lets you change sung lyrics in a recording without messing up the original tune or timing. It uses a special method called MeloDRP to carefully control how long each part of the new lyrics should be, matching them to the melody. Their system also knows how to blend the new lyrics smoothly into the unchanged parts of the song. They tested their approach and found it works better than earlier methods for editing singing voice recordings.

singing voice editingmelody-awareduration controlflow matchingphonetic cuespseudo-MIDIcross-attentionaudio infillingWhisperXlarge language model

Authors

Yoonjeong Park, Jaekwon Im, Juhan Nam

Abstract

Text-based singing voice editing (SVE) aims to revise sung lyrics while preserving the original melody, total duration, and non-edited regions. In this paper, we propose MeloDISinger, a flow-matching-based SVE model for melody-aware and duration-preserving editing. Its core module, MeloDRP, predicts fixed-budget duration ratios, enabling explicit span-wise duration control. For melody-aware duration allocation, MeloDRP fuses phonetic cues with pseudo-MIDI melodic context through cross-attention, while temporal-overlap supervision encourages soft phoneme--note correspondences. We further use a flow-matching mel decoder for audio infilling to synthesize edited regions while preserving surrounding context. In addition, we introduce a duration-aware edited-lyric generation pipeline using WhisperX and an LLM to construct feasible evaluation scenarios. Experiments demonstrate state-of-the-art performance in both objective and subjective evaluations.

View PDFOpen arXiv