Reweighting Framewise Attention in Video Transformers for Facial Expression Understanding

2026-06-29 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address the challenge of recognizing subtle facial expressions in videos by improving Vision Transformer (ViT) models, which usually focus only on big, obvious movements. They propose MiRA, a method that adjusts how the model pays attention to different video frames to better detect small, detailed facial changes without adding extra trainable parts. MiRA uses statistics from the model’s attention maps to highlight important frames and facial details, and includes two versions for different efficiency needs. Their experiments show that MiRA helps ViTs perform better on facial expression recognition tasks.

Vision TransformerFacial Expression RecognitionSelf-AttentionSpatio-temporal DynamicsAttention RedistributionFlashAttentionSelf-Supervised PretrainingFrame-level ImportanceVideo Models

Authors

Seongro Yoon, Donghyeon Cho, Jinsun Park, François Brémond

Abstract

Understanding facial expressions in videos requires modeling subtle and localized facial dynamics under unconstrained conditions. Although recent Vision Transformer~(ViT)-based video models have shown strong performance through large-scale self-supervised pretraining, their attention mechanisms often emphasize dominant global motions and coarse temporal dynamics, limiting sensitivity to fine-grained facial variations. To address this limitation, we propose MiRA (Marginal-induced Attention Redistribution), a plug-in frame-marginal attention redistribution framework for ViT backbones that enhances spatio-temporal selectivity toward subtle facial dynamics without introducing additional trainable parameters. MiRA derives frame-level confidence and intra-frame concentration statistics from self-attention maps to estimate frame-wise marginal importance and redistribute attention toward spatiotemporally localized facial cues. We first introduce a principled \textit{exact mode} based on post-softmax attention redistribution. To further improve efficiency, we propose \textit{flashLite mode}, a lightweight pre-softmax approximation that integrates frame-marginal redistribution into FlashAttention kernels while preserving the effectiveness of the exact formulation. Experimental results on challenging Facial Expression Recognition~(FER) benchmarks demonstrate consistent improvements over strong ViT baselines.

View PDFOpen arXiv