HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a special language called HDSL that helps computers build and edit 3D indoor scenes based on text instructions. Unlike older methods that miss details or are hard to change, their approach organizes scene parts like rooms and objects into a tree structure, making it easier to plan and update. They use language models to create parts of the scene, check for mistakes, and fix layout problems. Their method improves how well scenes match the text and saves time, especially when making edits focused on small parts of the scene.

LLM (Large Language Model)scene graph3D indoor scene generationdomain-specific languageXML/CSSmultimodal asset retrievalforce-directed layout optimizationlocal program repairHierarchical Retrieval-Augmented Generation (HRAG)
Authors
Letian Li, Chao Shen, Shuzhao Xie, Chenghao Gu, ZhengXiao He, Yu Meng, Xin Yang, Wenyuan Jiang, Zhi Wang
Abstract
Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.