MERIT: Learning Disentangled Music Representations for Audio Similarity

2026-05-26 • Sound

Sound

AI summaryⓘ

The authors created MERIT, a system that breaks down music into three separate parts: melody, rhythm, and timbre, instead of giving just one overall similarity score. They designed a special training method using generated sounds and separated audio tracks to make sure each part learns independently. Their tests showed that MERIT can successfully tell these musical features apart, working well on both made-up and real music. This helps users search and compare music in more detailed ways.

music similaritymelodyrhythmtimbredisentangled representationaudio source separationconditional audio generationmachine learningmusic information retrievalfactor-specific model

Authors

Abhinaba Roy, Junyi Liang, Dorien Herremans

Abstract

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

View PDFOpen arXiv