Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

2026-05-26 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors developed a system called ours to help computers better understand spatial relationships in videos based on questions asked. Instead of just remembering everything, their system keeps two types of memory: one for recent detailed information and one for important long-term facts. It uses a scoring method to decide which video frames to remember, focusing only on what matters for the question. Tests showed that this approach helps the model reason about space in videos more accurately than previous methods.

video spatial reasoninggeometric memorycamera-conditioned geometryvisual tokenslong-range contextQ-Formermemory updatespatial video-language modelsevidence scoringfeature representation

Authors

Xianqiang Gao, Qizhi Chen, Delin Qu, Haoming Song, Zhigang Wang, Bin Zhao, Dong Wang, Xuelong Li

Abstract

Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.

View PDFOpen arXiv