Map2World: Segment Map Conditioned Text to 3D World Generation

2026-05-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a new method called Map2World to create 3D environments based on user-drawn maps of any shape and size, fixing problems with previous methods that had scale and layout limits. They also made a detail enhancer network that adds fine details to the scenes without messing up the overall look. Their system uses pretrained asset generators to work well even with little training data and across different types of worlds. Tests show their method is better at letting users control the world, keeping object sizes consistent, and maintaining a coherent scene.

3D world generationsegment mapsscale consistencydetail enhancer networkglobal structureasset generatorsscene coherenceuser-controllabilitydeep learningenvironment simulation
Authors
Jaeyoung Chung, Suyoung Lee, Jianfeng Xiang, Jiaolong Yang, Kyoung Mu Lee
Abstract
3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments. To further enhance the quality, we propose a detail enhancer network that generates fine details of the world. The detail enhancer enables the addition of fine-grained details without compromising overall scene coherence by incorporating global structure information. We design the entire pipeline to leverage strong priors from asset generators, achieving robust generalization across diverse domains, even under limited training data for scene generation. Extensive experiments demonstrate that our method significantly outperforms existing approaches in user-controllability, scale consistency, and content coherence, enabling users to generate 3D worlds under more complex conditions.