Explicit Representation Alignment for Multimodal Sentiment Analysis

2026-06-08Computation and Language

Computation and Language
AI summary

The authors study how to better understand emotions by combining information from text and images. They found that a big problem is that the ways computers represent text and images don’t match well, which makes combining them tricky. To fix this, they use models that turn images into text descriptions so both types of data speak the same 'language.' They also add techniques to make these representations clearer and more stable. Their method performs better than others on tests involving sentiment and emotion recognition.

multimodal affective analysisvision-language models (VLMs)representation alignmentfusion strategiessemantic token selectionuniformity regularizationsentiment analysisemotion recognitiontext-image fusion
Authors
Baode Wang, Ziming Wang, Huacan Wang, Ronghao Chen, Biao Wu
Abstract
Multimodal affective analysis aims to understand human sentiment and emotion by jointly modeling heterogeneous modalities such as text and images. However, multimodal models often fail to consistently outperform strong text-only baselines, with performance varying significantly across fusion strategies. In this work, we identify representation misalignment between independently pretrained modality encoders as a key bottleneck for effective multimodal learning, and show through controlled experiments that alignment prior to fusion is often more important than fusion complexity. To address this issue, we propose a unified multimodal affective analysis framework that leverages vision-language models (VLMs) to convert visual content into structured textual descriptions, projecting heterogeneous modalities into a shared linguistic space and enabling interpretable text-centric reasoning. To further improve robustness, we introduce a hybrid learning strategy that combines semantic token selection with a batch-level uniformity regularization objective, encouraging a more dispersed and stable global feature space while mitigating noise introduced by VLM-generated descriptions. Experiments on multiple multimodal sentiment and emotion benchmarks show that our method consistently outperforms strong unimodal and multimodal baselines, achieving state-of-the-art performance. Our analysis further highlights the critical role of representation alignment in multimodal affective learning.