$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

2026-06-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce VG²GT, a new method for 3D scene reconstruction that uses a combination of voxel grids and Gaussian splatting guided by a frozen visual model. Unlike previous methods, their approach improves 3D geometry accuracy by learning from voxel features and supervising depth with a special rendering technique, without needing to retrain the visual model. This makes their method more efficient and adaptable, and it performs better than existing techniques on several popular 3D datasets.

Gaussian splattingvoxelvisual foundation model3D reconstructiondepth mapssolid volume renderingtransformernovel view synthesisDTU datasetScanNet dataset

Authors

Yibin Zhao, Yihan Pan, Jun Nan, Wenli Yang, Liwei Chen, Jianjun Yi

Abstract

Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. $\text{VG}^2$GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables $\text{VG}^2$GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. $\text{VG}^2$GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.

View PDFOpen arXiv