SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

2026-06-08 • Robotics

RoboticsArtificial IntelligenceComputer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce SpaceVLN, a navigation agent designed to follow language instructions in new environments without needing extra training. Instead of just relying on immediate visual clues, SpaceVLN builds a memory of important places and landmarks as it explores, helping it understand spatial relationships better. This memory allows the agent to plan and reason about navigation tasks step-by-step, improving its ability to find objects or locations. The authors tested SpaceVLN on several benchmarks and real robots, showing it works well without special training for each task.

Vision-and-Language NavigationSpatial Cognitive MemoryTask-Guided Spatial ReasoningSpatial WaypointsLandmarksZero-shot LearningEmbodied NavigationObject-Goal NavigationClosed-loop FrameworkHierarchical Memory

Authors

Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang

Abstract

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

View PDFOpen arXiv