OZ-TAL: Online Zero-Shot Temporal Action Localization

2026-05-11Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors introduce a new challenge called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to identify when and what action happens in videos as they are streaming, even if those actions were never seen during training. They propose a method that doesn't require additional training and uses existing Vision-Language Models, improving the way videos are understood while reducing biases. They create new tests and show that their approach works better than previous methods in both standard and zero-shot scenarios. This helps computers detect new actions quickly and accurately in live video streams.

Online Temporal Action LocalizationZero-shot LearningVision-Language ModelsStreaming Video AnalysisTemporal Action DetectionInstance-level UnderstandingTHUMOS14 DatasetActivityNet DatasetTraining-free FrameworkVisual Representation
Authors
Chaolei Han, Hongsong Wang, Xin Gong, Jie Gui
Abstract
Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.