ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors introduce ViSRA, a tool that helps large multimodal language models think better about 3D spaces in videos without retraining the models. Instead of changing the models themselves, ViSRA uses additional spatial information from expert systems to improve reasoning on the fly. This approach saves time and avoids creating special datasets, while showing improved performance on both known and new spatial tasks. Overall, ViSRA makes 3D understanding more flexible and human-like for these AI systems.

Multi-modal Large Language Models3D spatial reasoningVideo-based reasoningInference-time methodsPost-trainingModular AI frameworksTransfer learningSpatial intelligenceBenchmark datasetsPlug-and-play AI

Authors

Tingshu Mou, Jiabo He, Renying Wang, Ce Liu, Hao Yang, Tiehua Zhang, Jingjing Chen, Xingjun Ma

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.

View PDFOpen arXiv