TimeLogic Challenge @ CVPR 2026: Strong MLLMs Meet Evidence-Seeking Agents for Temporal-Logic Video Question Answering

2026-06-01Multimedia

Multimedia
AI summary

The authors study video question answering that needs understanding of when actions happen in relation to each other, like before or after, instead of just identifying what is in the video. They design an agent that actively explores the video by picking different time points to watch, using a cycle of thinking, acting, and observing. Their system assigns questions to categories with specific strategies and adapts how it looks through the video based on question type and video length. Without additional training, their approach combined with Gemini 3.1 Pro performs well on a benchmark that tests temporal reasoning in videos.

temporal logicvideo question answeringactive explorationtime-stamped observationsmulti-granular samplingtemporal relationslightweight classifierpolicy adaptationGemini 3.1 ProTimeLogic dataset
Authors
Zhaoyang Xu, Xusheng He, Wei Liu, Zhenyang Li, Jianlong Wu
Abstract
Temporal-logic video question answering requires a model to reason about when actions occur relative to one another, such as before, after, until, since, overlap, and multi-event chains, rather than merely what is present in a video. Standard vision-language models typically answer such questions in a single pass over a fixed, uniformly sampled set of frames, which is poorly matched to evidence that is often localized to narrow action boundaries or dispersed across several distant events. We present an evidence-seeking agent that treats temporal-logic VideoQA as active exploration. The agent follows a Think-Act-Observe loop driven by a multi-granular sampling toolkit, where every observation is interleaved with its absolute timestamp so that temporal relations reduce to numerical comparisons on a shared time axis. Its behavior is shaped by benchmark structure: a lightweight classifier routes each question to a temporal category, each with a tailored policy, iteration depth, and prompt, while sampling budgets adapt to corpus characteristics and clip length. The resulting training-free system couples Gemini 3.1 Pro with a temporal-reasoning policy and achieves 77.13 AvgAcc on the official TimeLogic test set.