AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

2026-06-08 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors identify a problem where large AI models struggle to understand space by turning what they see from their own viewpoint into a bigger map that shows where things are relative to everything else. They introduce AlloSpatial, a system that helps these models build detailed maps (called allocentric spatial representations) from what they observe directly. Their approach includes tools to handle confusing or noisy visual information and a way to learn better using rewards. Tests show that AlloSpatial improves performance on spatial tasks without extra training and beats some bigger models when trained, suggesting this method helps AI better understand physical space.

Multimodal Foundation ModelsEgocentric ObservationAllocentric Spatial RepresentationCognitive MappingAllocentric-Spatial TreesSpatial ReasoningReinforcement LearningModality-decoupled Cue CollectionGeometry-Semantic Arbitration

Authors

Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Jingzhi Li, Yubin Wang, Xingxing Wei

Abstract

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.

View PDFOpen arXiv