Question-Aware Evidence Ledgers for Video Relational Reasoning

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a system to answer questions about videos by reasoning about relationships between objects and events over time, not just from one frame. They used a powerful GPT-5.5 model combined with special evidence trackers focused on details like counting, spatial locations, and dialogue. Their method only changes answers when strong new evidence appears, making it careful and accurate. This approach achieved about 93% accuracy on a visual relational reasoning video quiz.

Visual relational reasoningVideo question answering (QA)GPT-5.5Spatial relationsEvent boundariesDialogue contextOpen-vocabulary detectionScene graphASR (Automatic Speech Recognition)
Authors
Yilin Ou, Mengshi Qi, Huadong Ma
Abstract
The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.