3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding
2026-03-24 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors created 3DCity-LLM, a new system that helps computers understand and analyze large 3D city environments using language. Their method looks at objects, relationships between objects, and the entire scene all at once to get a better understanding. They also built a big dataset with 1.2 million examples to train and test the system on different city-related tasks. Their tests show that 3DCity-LLM works better than previous methods at understanding complex urban scenes. They also provide their code and dataset for others to use.
3D vision-language modelsmulti-modalitycoarse-to-fine encodingobject-centric analysisscene understandingspatial reasoningurban intelligencedataset creationsemantic evaluationlarge language models (LLMs)
Authors
Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu, Hongchao Fan, Hao Wu
Abstract
While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.