Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

2026-05-11Sound

Sound
AI summary

The authors address the problem of editing specific parts (stems) of a mixed music track without changing the other parts. They find that existing methods struggle to precisely isolate and change only the target stems due to limitations in how attention mechanisms work. To fix this, they introduce Polyphonia, a system that uses acoustic information to better define boundaries between stems, allowing for more accurate editing. Their tests show that Polyphonia improves the separation of the target stem while keeping the rest of the music intact.

diffusion-based generationtext-to-music synthesisstem-specific timbre transfercross-attentionacoustic priorsemantic featurespolyphonic musiczero-shot editing
Authors
Haowen Li, Tianxiang Li, Yi Yang, Boyu Cao, Qi Liu
Abstract
The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic features of stems, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a zero-shot editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling non-target stems preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 timbre transfer tasks in polyphonic music. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and non-target integrity.