EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present EditSSC, a method to create and edit 3D semantic scenes important for self-driving cars. Instead of using complex 3D-specific tools, they convert 3D data into simpler 2D bird’s eye view images and use existing 2D diffusion models with little change. Their approach allows easy editing like sketch-guided changes without additional training. Tests show EditSSC works better than specialized 3D methods on standard datasets, proving 2D tools can handle 3D scene tasks effectively.

3D semantic scene generationBird's Eye View (BEV)latent diffusion networksemantic occupancy gridsStable Diffusionquantized autoencoderUNetSemanticKITTIsketch-guided generationinpainting
Authors
Fatima Balde, Raoul de Charette, Alexandre Boulch
Abstract
3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex 3D-specific architectures such as triplane encoders and adapted diffusion networks, limiting both their simplicity and their editing capabilities. We propose EditSSC, an editing-ready method for 3D semantic scene generation using 2D Bird's Eye View (BEV) representations and off-the-shelf latent diffusion network. Our approach reshapes 3D semantic occupancy grids into multi-channel BEV images and leverages the quantized autoencoder and UNet from Stable Diffusion with minimal modifications. We perform diffusion on the latents after quantization, which enables training-free editing capabilities. By exploiting class-to-code correspondences in the codebook, our method supports sketch-guided generation, inpainting, and outpainting without any retraining. On SemanticKITTI, EditSSC outperforms existing 3D-specific baselines on unconditional generation, demonstrating that well-established 2D architectures can be effectively repurposed for 3D scene generation and editing.