Decoupled Object-Centric Video Understanding for Generating Robotic Manipulation Commands

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics

AI summaryⓘ

The authors developed a new way for robots to understand video demonstrations by focusing on the important objects involved in actions, not just the action itself. They separate the task of recognizing what is happening from identifying which objects matter, using a combination of techniques like Temporal Shift Modules for action detection and a new algorithm to pick relevant objects. These objects are then classified using Vision-Language Models that can recognize categories even if they haven't seen them before. Tested on a standard dataset, their method improves action recognition and command generation compared to previous approaches. Overall, their system better translates videos into clear robot commands by concentrating on objects and actions separately.

Temporal Shift ModulesObject Selection AlgorithmSpatio-temporal Action ClassificationVision-Language ModelsZero-shot GeneralizationSomething-Something V2 DatasetBLEU-4 ScoreMETEORCIDEr

Authors

Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Xiem HoangVan, Nak Young Chong

Abstract

Translating video demonstrations into executable robot commands remains challenging because existing methods often fail to identify which objects are functionally involved in the demonstrated action. As a result, they may generate commands that are linguistically plausible but operationally ambiguous. We propose an object-centric video understanding framework that decouples action recognition from object identification to generate precise, grammar-free manipulation commands. Our approach integrates Temporal Shift Modules (TSM) for efficient spatio-temporal action classification with a novel \textbf{Object Selection} algorithm that identifies task-relevant objects through trajectory-based role classification, blur detection, and overlap minimization. The selected objects are then processed by Vision-Language Models (VLMs) for robust category recognition and zero-shot generalization. Evaluated on a modified Something-Something V2 dataset, our method achieves 86.79\% action classification accuracy and BLEU-4 scores of 0.337 on standard objects and 0.261 on novel objects. These results improve over the strongest task-specific baseline by 80.2\% and 143.9\%, respectively. Larger gains are observed in METEOR and CIDEr, reaching 157.9\% and 171.7\% on novel objects. Across all semantic metrics, our approach consistently outperforms task-specific methods and remains competitive with, or surpasses, large general-purpose VLMs while retaining a modular, object-centric design.

View PDFOpen arXiv