UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

2026-04-09Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created UniversalVTG, a single model that can understand and find moments in videos from many different datasets without needing lots of computing power. They use a tool called a Query Unifier to turn various types of text questions into a common language that helps the model understand better. This way, their model works well for long videos and different tasks all at once. Despite being much smaller than some recent large models, UniversalVTG performs just as well or better on many tests.

video temporal groundingmultimodal language modelscross-dataset pretrainingQuery Unifierlong-video analysisbenchmark datasetsmodel efficiencydataset generalization
Authors
Joungbin An, Agrim Jain, Kristen Grauman
Abstract
Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $>100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.