Make Geometry Matter for Spatial Reasoning

2026-03-27Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors explain that current vision-language models (VLMs), which understand images and videos, struggle with spatial reasoning because they mostly rely on flat 2D visual information. To improve this, the authors propose GeoSR, a method that helps models better use 3D geometry information by hiding some 2D parts during training and selectively focusing on geometric cues when needed. Their approach helps the models make smarter decisions about space in both images and videos. Tests show that GeoSR beats earlier methods in spatial reasoning tasks by making geometric information more important.

vision-language modelsspatial reasoning3D geometry tokensmaskingtoken fusiongated routingstatic scenesdynamic videosfine-tuningcomputer vision
Authors
Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang
Abstract
Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.