TuneJury: An Open Metric for Improving Music Generation Preference Alignment

2026-06-15 • Sound

SoundArtificial IntelligenceMachine LearningMultimedia

AI summaryⓘ

The authors present TuneJury, a model that scores how well a piece of music matches a given text prompt by comparing pairs of audio clips based on human preferences. They trained the model using several types of human feedback, like votes and expert ratings, and found it works well both on similar and different types of music data. To make TuneJury work with new music generators, they introduce a quick calibration step that adjusts the scores without needing full retraining. TuneJury helps improve music generation in various tasks by consistently picking better music examples according to its learned preferences.

text-to-musicpairwise comparisonreward modelhuman preferencescore calibrationBradley-Terry modellatent optimizationinference-time selectionexpert iteration

Authors

Yonghyun Kim, Junwon Lee, Haiwen Xia, Yinghao Ma, Junghyun Koo, Koichi Saito, Yuki Mitsufuji, Chris Donahue

Abstract

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.

View PDFOpen arXiv