Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning
AI summary

The authors developed a new method to tell if someone is doing a task wrong by looking at videos recorded from their point of view. Their method uses two models: a big one that checks if small parts of the action are done incorrectly, and a small one that looks at the whole video to see if the parts fit together properly. These two models work together to better catch subtle mistakes, especially when errors are rare or unclear. They also use special training techniques to handle the uneven number of right and wrong actions in the videos.

Egocentric videoAction recognitionCoarse-grained video understandingFine-grained action reasoningCLIP modelDiffusion Contrastive ReconstructionQwen3-VL EmbeddingCross-entropyAUC-oriented learningLabel-aware adjustment
Authors
Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang
Abstract
In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.