T2LDM++: A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation
2026-06-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address problems in creating 3D LiDAR scenes from text descriptions, where past methods often produced blurry results due to limited training data. They propose a new model called T2LDM++ that improves quality by guiding the learning process to better capture shapes and geometry. They also built large new datasets with detailed annotations to help train and evaluate the model. Their method can generate detailed and accurate 3D scenes from text and other inputs while keeping computations efficient. Overall, the authors focus on making LiDAR scene generation more controllable and realistic.
LiDARText-to-Image GenerationDiffusion ModelsDenoising Diffusion Probabilistic Models (DDPM)Self-Conditioned Representation GuidanceScene ReconstructionGeometric Annotations3D Scene GenerationConditional GenerationControllability Metrics
Authors
Wentao Qu, Qi Zhang, Chenxu Wang, Guofeng Mei, Yongfei Liu, Xiaoshui Huang, Gim Hee Lee, Liang Xiao
Abstract
Recent progress in Text-to-Image generation benefits from large-scale Text-Image pairs. However, the scarcity of Text-LiDAR pairs often causes over-smoothed scenes and limited controllability. In this paper, we rethink the limitations of Text-LiDAR generation task, focusing on alleviating insufficient training priors and constructing controllable Text-LiDAR data. We propose a \textbf{T}ext-\textbf{to}-\textbf{L}iDAR \textbf{D}iffusion \textbf{M}odel for LiDAR scene generation, T2LDM++, with a Self-Conditioned Representation Guidance (SCRG). Specifically, to alleviate object over-smoothing, SCRG employs a Guidance Network (GN) to provide reconstruction-based soft supervision to the Denoising Network (DN). This enables DN to learn geometry-aware representations through reconstruction guidance, leading to more accurate denoising in DDPMs. Meanwhile, through analysis and design, SCRG exhibits more effective and lightweight, while decoupled in inference, avoiding computational overhead. Furthermore, we construct two high-quality Text-LiDAR benchmarks ($>$100K samples) using a generalized strategy of geometric annotations, along with a controllability metric. Moreover, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, T2LDM++ supports multiple conditions, including (Semantic, Box, BEV, Camera)-to-LiDAR, Sparse-to-Dense, and Dense-to-Sparse generation, by learning a control encoder via frozen DN. With effective prior modeling and high-quality Text-LiDAR benchmarks, T2LDM++ can generate realistic LiDAR scenes with rich geometric details in unconditional and conditional settings.