EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

2026-05-13 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present EvoGround, a system that learns to find specific parts of a video matching a text query without using any labeled training data. It has two parts: one that suggests possible matches and another that tries to verify them, both improving each other through repeated practice. Starting from the same base, these parts get better over time by sharing feedback. Trained on thousands of unlabeled videos, EvoGround performs as well as or better than methods that rely on human annotations and can also create detailed video captions without manual labels.

video temporal groundingnatural-language queryself-supervised learningreinforcement learningproposer-solver frameworkunlabeled datavideo captioningmutual learningbackbone modelvideo understanding

Authors

Minjoon Jung, Byoung-Tak Zhang, Lorenzo Torresani

Abstract

Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

View PDFOpen arXiv