MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs
2026-06-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors created Musebench, a new test to see how well AI models understand art by not just recognizing images or sounds but by figuring out why artists make certain creative choices. They made this test using thousands of questions from expert commentary across movies, visual art, stage performances, and games. When they tested 28 advanced AI models, even the best one got less than half the answers right, showing these models still struggle with deep artistic understanding compared to humans. This means current AI can see and hear art but doesn’t fully grasp the reasons behind artistic decisions.
audiovisual artsmultimodal large language modelsartistic understandingcreative intentbenchmarkzero-shot evaluationcinematic artsvisual artsstage performing artsgame arts
Authors
Yuxuan Fan, Gyusik Seo, Jing Hao, Jaemin Cho, Mohit Bansal, Jaehong Yoon
Abstract
Audiovisual arts encompass diverse creative disciplines, including cinema, visual arts, stage performance, and game design, where artistic meaning arises from deliberate combinations of visual, auditory, and narrative elements (e.g., fear amplified through claustrophobic framing, or grief conveyed through silence and lingering close-ups). True artistic understanding extends beyond recognizing what is depicted to reasoning about why it is expressed through particular creative choices. Despite the strong progress of multimodal large language models (MLLMs), this critical aspect of artistic understanding remains underexplored, as existing benchmarks largely measure perceptual recognition while overlooking reasoning about creative intent. To address this gap, we introduce Musebench, a comprehensive benchmark designed to evaluate MLLMs on nuanced artistic understanding. It comprises 4,016 questions spanning cinematic arts, static visual arts, stage performing arts, and game arts, distilled from over 10K candidate video essays that pair professional commentary with visual demonstration. To capture the open-ended nature of artistic analysis at scale, the benchmark combines single-select and variable-option multi-select questions. All questions are generated and refined through a four-phase iterative pipeline combining shortcut filtering, adversarial distractors, and expert validation. Comprehensive zero-shot evaluation of 28 state-of-the-art MLLMs reveals that even the best-performing model achieves only 48.29% accuracy, substantially below human expert performance of 87.18%, exposing a significant gap in current models' creative domain expertise.