Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present their winning solution for recognizing tiny hand movements in a competition. They combined different video analysis models, including one that learned by itself from many unlabeled videos before being fine-tuned on the specific task. This self-supervised model improved accuracy significantly and, when combined with other models, led to the best overall results. Their experiments show that learning from unlabeled video data can help computers better understand subtle gestures.

micro-gesture classificationself-supervised learningRGB video modelingmasked video modelingensemble methodsiMiGUE datasettop-1 accuracytransfer learningmulti-stream modelsIJCAI MiGA Challenge
Authors
Tingyi Liu, Kun Li, Fei Wang, Junjie Chen, Zhiliang Wu, Jihao Gu, Haixu Liu, Dan Guo
Abstract
In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.