Vision-Language Model Reasoning for Contextual Semantic Mapping in Intralogistics

2026-06-23 • Robotics

Robotics

AI summaryⓘ

The authors developed a system for robots to better understand their surroundings by combining geometric maps with semantic information about objects, such as whether they can be moved. They use several techniques including SLAM for mapping, SAM for identifying objects, and vision-language models (VLMs) to reason about object properties from multiple views without needing special training. Their system accurately classifies objects and estimates movability, which helps robots navigate more effectively in changing warehouse environments. They also found that the reasoning part of VLMs and the process of grouping object instances are the main challenges in their system.

Autonomous mobile robotsIntralogisticsSLAMInstance segmentationVision-language models (VLM)Contextual semantic mappingZero-shot learningObject movabilityMulti-view reasoningPanoptic segmentation

Authors

Marvin Rüdt, Hao Pang, Constantin Enke, Zäzilia Seibold, Kai Furmans

Abstract

Autonomous mobile robots operating in intralogistics environments rely on geometric maps for localization and navigation, but lack semantic understanding of objects and their contextual properties. We present a contextual semantic mapping pipeline that combines SLAM-based geometric mapping, SAM-based instance segmentation, instance clustering, and VLM multi-view reasoning to produce a contextual semantic map representation encoding geometric structure, object class, and object movability. By aggregating observations across multiple viewpoints and querying a VLM in a zero-shot, open-vocabulary setting, the pipeline infers contextual object properties--here demonstrated through movability--without requiring task-specific training or predefined object categories. We evaluate three VLMs under two prompting strategies and conduct a component-wise analysis of the pipeline. The proposed pipeline achieves 98.93 % mIoU for semantic classification and 89.17 % mAcc for object movability estimation. Component analysis identifies VLM reasoning as the primary bottleneck for contextual understanding and instance clustering as the main limitation for panoptic performance. The resulting semantic map supports context-aware filtering and robust navigation in dynamic intralogistics environments.

View PDFOpen arXiv