TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMachine Learning
AI summary

The authors introduce the TimeLogic Challenge to test how well computers can understand the timing of events in videos using logical rules. They find that many video-language models struggle because they look at videos as disconnected frames and can't tell exactly when actions happen. Their solution, called TLG, uses known action timelines to answer questions with logic, and switches to other models only when needed. This approach improves accuracy substantially, showing that having clear timing information about actions is key to better performance, rather than just bigger models.

temporal logicvideo-language modelsaction localizationtimeline reconstructiontemporal reasoningvideo understandinglogic programsmodel ablationbenchmark accuracyVLM
Authors
Ali Alavi
Abstract
The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, ...) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video's action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.