AI summaryⓘ
The authors address the problem of creating 3D models of city building fronts from videos, which is tricky because things like glass and reflections confuse current methods. They propose a technique called shell-supervised Gaussian Splatting that uses a simple geometric framework (a facade shell) to guide the reconstruction, ensuring the 3D shapes are more accurate. Their method combines this geometric guidance with traditional video color information to improve how well the 3D shapes match the real building surfaces. Tests show it creates better-aligned and more consistent 3D buildings compared to methods relying only on images. However, the quality of the images it produces remains similar to previous approaches.
embodied AIvideo-to-3D reconstructionGaussian Splattingfacade shellgeometric supervisiondepth renderingnormal mapspoint-cloud consistencynovel-view synthesismask-gated losses
Authors
Yuan Yang, Peijun Lu, Fangzhou Lu, Sai Fan, Siqi Yan, Chenyuan Zhang, Haobo Liang, Yichen Wang
Abstract
Real-to-sim reconstruction for embodied AI requires geometry that is useful for collision reasoning, navigation, and agent-environment interaction, not only photorealistic novel-view synthesis. However, close-range urban facades are difficult for video-to-3D reconstruction: glass, reflections, repeated windows, and weak texture can produce visually plausible renderings with unstable surface geometry. We introduce shell-supervised Gaussian Splatting, a reconstruction-stage framework that uses an external facade structural shell as lightweight geometric supervision for video-driven Gaussian reconstruction. The method aligns an exterior shell to the video reconstruction frame, renders per-view depth, camera-space normal, and valid-mask maps, and applies these cues through mask-gated losses during Gaussian optimization. This design preserves RGB-driven appearance while regularizing only visible shell-supported facade regions. Experiments on anonymized close-range urban facade scenes show improved facade orientation and visible-surface point-cloud consistency over photo-only, monocular-cue, and surface-oriented Gaussian baselines, while maintaining comparable held-out rendering quality.