InstrAct: Towards Action-Centric Understanding in Instructional Videos

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors focus on making computers better at understanding step-by-step actions in instructional videos, which is hard because current models often get distracted by objects instead of focusing on movements. They created a new training method called InstrAction that cleans up confusing video captions and teaches the model to pay more attention to actions through special techniques. They also added new tools to better track how actions happen over time and to link what is seen with language. Their tests show this method helps computers understand actions more accurately than previous models.

Instructional videosVideo Foundation ModelsContrastive learningAction-centric representationDynamic Time WarpingMasked Action ModelingTemporal relationsCross-modal groundingSemantic reasoningProcedural logic

Authors

Zhuoyi Yang, Jiapeng Yu, Reuben Tan, Boyang Li, Huijuan Xu

Abstract

Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos' action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method consistently outperforms state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.

View PDFOpen arXiv