GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction
2026-06-15 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors aim to predict how viewers feel when watching video ads by understanding hidden emotions. They found that existing models look at whole video frames and miss important small actions and interactions. Their method breaks down videos into simple descriptions of who does what to whom and pairs these with pictures of the key people or objects involved. This helps the model better figure out the emotions by focusing on clear, detailed clues. Tests show their approach works well on multiple datasets, improving emotional understanding in videos.
viewer sentiment predictionvideo advertisementsMultimodal Large Language Modelssubject-verb-object tripletsvisual entity cropsemotional reasoningaction-centric descriptionsPitts datasetAdsQATVQA
Authors
Ruoxuan Yang, Tieyuan Chen, Xiaofeng Huang, Haibing Yin, Jun Wang, Xiping Chen, Jun Yin, Xuesong Gao, Weiyao Lin
Abstract
Viewer sentiment prediction in video advertisements aims to infer the latent affective response evoked in the audience. To bridge the gap between what is shown and what is felt, models must deduce hidden viewer emotions from explicit visual narratives, concrete character-object interactions, and visible textual cues. However, standard Multimodal Large Language Models (MLLMs) typically rely on holistic frame representations, which leave these fine-grained, affect-relevant events implicit and complicate precise emotional reasoning. To address this, we propose a grounded action-centric evidence augmentation framework that enhances video MLLMs' clue extraction and comprehension by introducing explicit event structure and localized visual evidence. Our method extracts temporally ordered subject-verb-object (SVO) triplets and auxiliary visible textual cues from action-centric video descriptions, grounds subject and object entities as visual entity crops, and then enables the MLLM to perform clue-enhanced emotional reasoning based on these extracted structured clues. In this way, action triplets specify "what happens", while grounded visual entity crops anchor "who or what participates in each event" to concrete visual evidence. Experiments on the Pitts dataset show consistent improvements over Qwen2.5-VL and Qwen3-VL baselines. Ablation studies, cross-dataset evaluation on AdsQA, and transfer experiments on an emotion-focused TVQA subset further support the effectiveness and generalization of our approach.