From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

2026-06-08 • Robotics

RoboticsArtificial IntelligenceComputation and LanguageComputer Vision and Pattern RecognitionGraphics

AI summaryⓘ

The authors explore using large language models (LLMs) to automatically match objects in 3D scenes to formal categories, instead of relying on fragile, manual dictionaries. They test this on a kitchen scene and find that LLMs can correctly identify objects most of the time, especially when given clear names. The models use context from the scene's organization to help, but struggle when that information is hidden. Overall, the study shows LLMs can help connect 3D objects to known categories without extra training.

knowledge graph3D simulationlarge language modelUniversal Scene Description (USD)ontologyzero-shot learningscene graphsemantic groundingSOMA-HOMEprompting

Authors

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

Abstract

Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.

View PDFOpen arXiv