CrossMaps: Confidence-Aware Open-Vocabulary Semantic Mapping for Rover Navigation

2026-06-15 • Robotics

RoboticsArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors developed CrossMaps, a system that helps rovers understand and map their surroundings in real time using camera and depth data. Their method combines language-based image understanding with a two-part memory system to keep track of both recent and reliable information. This allows the rover to create detailed maps that can be searched using natural language, helping it navigate better even when some information is unclear. The system is designed to work with real rover hardware and to improve how robots explore and move around.

semantic mappingRGB-D dataCLIP embeddingsconfidence-aware fusionShort-Term MemoryLong-Term MemorySLAMrobot navigationJetson Orinopen-vocabulary

Authors

Jan-Niklas Klein, Sona Ghahremani, Christian Medeiros Adriano, Holger Giese

Abstract

Rovers rely on perception to maintain spatial maps that encode both objects and sensor quality (e.g., range reliability, lighting artifacts, data density), guiding data fusion, embedding updates, and navigation under partial observability. To study these coupled perception-navigation processes, we present CrossMaps, a real-time confidence-aware open-vocabulary semantic mapping pipeline that constructs language-queryable maps from RGB-D data. Building on VLMaps-style approaches, CrossMaps integrates multi-scale CLIP embeddings with confidence-aware fusion and a dual-memory architecture consisting of Short-Term Memory (STM) and Long-Term Memory (LTM). The STM aggregates noisy visual observations using geometric, semantic, and temporal confidence cues, while confident and coherent cells are promoted to the LTM as persistent semantic landmarks. Designed for deployment with a Jetson Orin-powered UGV alongside SLAM, CrossMaps runs in real time and produces semantic heatmaps that can be queried with natural language to guide rover navigation.

View PDFOpen arXiv