AI summaryⓘ
The authors studied whether emotion-related information learned by Transformer models for text and speech matches a well-known psychological model called Russell's circumplex, which organizes emotions by valence and arousal. They tested different models, including separate text and speech models and a combined one, using both natural and computer-generated data. Their results show that when combining text and audio, the models perfectly reflect the emotion structure proposed by Russell. They also found that even without extra training, general text embeddings place emotion words close to their expected positions. This work shows that these models inherently capture meaningful emotional patterns beyond just human annotations.
Affective computingTransformer modelsRussell's circumplex modelValence-arousalLatent spaceRoBERTawav2vec 2.0Multimodal fusionZero-shot learningEmotion representation
Authors
Amdjed Belaref, Samir Sadok, Zineb Noumir, Renaud Seguier
Abstract
Affective computing increasingly relies on deep learning to represent emotions, yet latent spaces often remain opaque, high-dimensional black boxes. This paper investigates whether Transformers' embeddings recover the geometric regularities of Russell's circumplex model. We unify two complementary experiments testing the hypothesis that, after training models on text and speech, their resulting latent spaces encode a topology consistent with valence-arousal and reproduce human-like neighborhood relations. Specifically, we evaluate deep representations extracted from Transformer-based text (RoBERTa) and speech (wav2vec 2.0) encoders, along with a multimodal Transformer fusion architecture, across naturalistic datasets like MSP-Podcast and controlled LLM-generated stimuli. Our analysis reveals that multimodal fusion of text and audio yields perfect topological alignment with Russell's primary emotion ordering. Furthermore, in a zero-shot setting using generic text embeddings, projected fine-grained emotion terms fall close to their established human-mapped coordinates. Our contribution is a novel, data-driven framework for validating emotion models, demonstrating that Russell's circumplex structure is intrinsically encoded in the embeddings of these modalities rather than being solely an artifact of human labeling, thereby bridging the gap between psychological theory and representation learning.