Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

2026-06-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address the difficulty of enabling vision-language models to understand spatial relationships by treating these models as active participants rather than passive observers. They create a dynamic cognitive map that remembers object positions and orientations to help track scenes better. They also introduce Spatial Assertion Codes, which are like simple computer programs that describe and check spatial relationships step-by-step, giving the model more feedback to learn from. By combining these ideas and fine-tuning their model, the authors achieve much better results on a spatial reasoning benchmark.

vision-language modelsspatial reasoningcognitive mapsreinforcement learningspatial relationshipsdense rewardssupervised finetuningMindCube benchmarkprogrammatic reasoning

Authors

Wei Deng, Xianlin Zhang, Mengshi Qi

Abstract

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

View PDFOpen arXiv