LLM-based Multimodal Personality Recognition via Facial Action Unit-Text Semantic Fusion

2026-06-29Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors focus on improving personality detection from video interviews by combining facial expressions with what people say. They turn face movements into readable text and mix this with the person's spoken answers using a language model. This method keeps important facial and speech details, making predictions more accurate and easier to understand. Tests show their approach works better than past methods by using both words and face cues together.

Personality recognitionAsynchronous video interviewsLarge language modelsFacial action unitsMultimodal fusionSemantic representationRegressionTemporal dynamicsInterpretable embeddingsAVI-6 benchmark
Authors
Tianyi Zhang, Wei Shan, Yuan Zong, Tianhua Qi, Wenming Zheng
Abstract
Personality recognition in asynchronous video interviews (AVIs) has become increasingly important due to their widespread adoption in modern recruitment. Existing approaches often rely on large language models (LLMs) to analyze textual responses of interviewees in AVI. However, unimodel methods often suffer from information loss (e.g., ignore facial cues). In contrast, multimodal methods that employ full-face images or sparsely sampled frames can discard fine-grained temporal dynamics critical for accurate personality assessment. To overcome these limitations, we propose an LLM-based framework that semantically fuse facial action units (AUs) with textual responses of AVI. AU sequences are first converted into interpretable textual descriptions, which are then fused with participants' textual responses through an LLM. A lightweight regression head transforms the resulting embeddings into continuous personality scores without disrupting the underlying semantic space. Experiments on the AVI-6 benchmark demonstrate consistent improvements over most baselines, with lower prediction errors and stronger correlations with human-rated scores across multiple traits. Further analysis reveals that AU-derived semantic representations offer complementary non-verbal cues to textual responses. Decoupling semantic understanding from regression prediction within the LLM also leads to greater training stability and clearer interpretability. Overall, these findings demonstrate that AU-text fusion provides a psychologically grounded and computationally efficient framework for personality recognition in AVIs.