From Pixels to Concepts: Growing Rich 3D Semantic Scene Graph Forests utilizing Foundation Models

2026-06-22Robotics

Robotics
AI summary

The authors studied how robots can better understand their environments by using detailed 3D maps that include not just objects but also how they relate to each other in meaningful ways. They used advanced AI models to identify and reason about both specific objects and more abstract concepts, creating layers of connected information called hierarchical 3D scene graph forests. Their method was tested on indoor datasets and a real robot, showing it can improve tasks like finding objects. This work helps robots gain a deeper understanding of their surroundings beyond just fixed categories.

3D scene graphshierarchical modelsfoundation modelsvision-language models (VLM)large language models (LLM)semantic relationshipsrobotics perceptionscene understandingScanNet datasetBoston Dynamics Spot
Authors
David Oberacker, Meike Deitersen, Niklas Spielbauer, Tristan Schnell, Georg Heppner, Arne Roennau
Abstract
Operating in complex real-world environments requires robots to understand their surroundings on a functional semantic level. This demands a detailed multi-layer world model capturing the complex relations of its surroundings. Hierarchical 3D scene graphs address this challenge by integrating geometric, semantic, and relational data within a unified spatial framework. However, current 3D scene graph approaches often restrict themselves to rigid structures of pre-determined relationship classes, mostly neglecting important semantic connections, like causal connections or environmental contexts. This paper explores the potential of foundation models to build forests of 3D scene graphs with open semantic relationships to improve scene understanding and robotic task execution. We propose a method where instance-specific concept-nodes and relationships are first identified by a VLM and extended upon by a LLM, inferring broader, more abstract concept-nodes and relationships through reasoning. These object-nodes, concept-nodes, and relationships are then assembled into a forest of hierarchical 3D scene graphs, enhanced with concept-nodes to represent abstract concepts. Evaluations were conducted on the uHumans2 and ScanNet indoor dataset, validating the accuracy and relevance of the generated relationships. Downstream suitability of scene-graph forests for robotics applications is demonstrated in an open-vocabulary object-retrieval task utilizing both ScanNet data and a real-world indoor deployment using a Boston Dynamics Spot. This paper leverages foundation models to create more expressive, semantically deep 3D hierarchical scene graphs and demonstrates their potential to advance semantic and environmental understanding in robotics.