Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization

2026-05-27 • Machine Learning

Machine LearningInformation RetrievalSound

AI summaryⓘ

The authors created a music recommendation system called AMRS designed to improve listeners' emotional states, especially for older adults with brain conditions and general wellness users. Because testing emotions directly online is tricky and sometimes unethical, their system uses a model that predicts how people will feel and engage based on past listening data. This model helps train and test the recommendation policies without bothering real users during development. Their approach improves predicted feelings like mood and energy while keeping music variety, showing a way to recommend music for emotional benefit when direct testing isn't possible.

affective music recommendationcausal transformerengagement predictionvalence and arousaldirect preference optimizationworld modeloffline policy trainingbehaviour cloningcold-start problemdistributional collapse

Authors

Audrey Chan, Aaron Labbé, Jacob Lavoie, Jordan Bannister, Arsène Fansi Tchango, Guillaume Lajoie, Laurent Charlin

Abstract

Functional music applications, from consumer focus and sleep aids to clinical interventions, share a distinctive recommendation problem: success is defined by the listener's affective state, but online experimentation on emotion is ethically constrained, particularly for clinical populations who cannot reliably skip a song or report distress. We describe AMRS, the Affective Music Recommendation System deployed on LUCID's health-and-wellness platforms, which serve clinical users (primarily older adults with neurocognitive conditions) and consumer-wellness users across energize, focus, calm, and sleep modes. AMRS is built around a rollout-based world model: a causal transformer trained on logged listening data to jointly predict engagement, binary rating, and self-reported valence and arousal. The world model serves both as an in-silico simulator for offline policy training and as a stress-testing tool before deployment. A recommender policy initialized by behaviour cloning is fine-tuned offline with Direct Preference Optimization (DPO) against a configurable multi-objective utility function. Under a strict cold-start protocol, the world model predicts both behavioural and affective signals with usable fidelity; DPO improves predicted valence and arousal over the cloned baseline while maintaining a similar diversity profile and avoiding the distributional collapse produced by greedy optimization. We position the work as an early deployed validation of a methodology for affective recommendation when online experimentation is ethically untenable.

View PDFOpen arXiv