ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors identify that current vision-language models have trouble reliably understanding spatial relationships because they don't properly focus on the visual details during reasoning. They create a new dataset and find two common problems when training these models: ignoring visual information and unstable reasoning towards the end. To fix this, they propose ProSR, a method that encourages models to base their reasoning on what they actually see and keep their thought process stable. Their experiments show that this approach improves both accuracy and the quality of the model's reasoning steps.

vision-language modelsspatial reasoningchain of thought (CoT)reinforcement learningprocess degradationspurious groundingtrajectory stabilitycounterfactual invarianceoptimization frameworkout-of-distribution benchmarks
Authors
Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong
Abstract
Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraints on the reasoning process, and therefore struggle to ensure genuine visual dependence and stable reasoning trajectories. In this paper, we construct a high-quality CoT dataset covering diverse spatial phenomena and diagnose the model's reasoning process, revealing two typical types of process degradation during reinforcement learning optimization: Spurious Grounding, which bypasses visual evidence, and Tail Instability, where uncertainty abnormally rises in the later stage of reasoning. To address these issues, we propose ProSR, a process-shaping optimization framework for spatial reasoning. Through a Counterfactual Invariance Penalty and a Tail Drift Penalty, ProSR extends the optimization objective from single answer correctness to two process-level dimensions: visual dependence and trajectory stability. Experiments on multiple complex and out-of-distribution spatial reasoning benchmarks show that ProSR improves answer accuracy while generating reasoning trajectories that are more stable and more dependent on visual evidence.