CodecSplat: Ultra-Compact Latent Coding for Feed-Forward 3D Gaussian Splatting

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose CodecSplat, a new method to compress 3D scenes created by a process called feed-forward 3D Gaussian splatting. Instead of compressing the final 3D data, CodecSplat compresses an earlier 2D feature representation, making the compressed files much smaller and more efficient. This approach helps keep the important details in the 3D scenes while reducing storage and transmission needs by about ten times compared to previous methods. Their results show good quality (measured by PSNR) with very compact file sizes on public datasets.

3D Gaussian splattingcompressionfeed-forward networkslatent codingentropy codingPSNRdepth predictionbitstreamrate-distortionmulti-view feature refinement
Authors
Pengpeng Yu, Runqing Jiang, Qi Zhang, Dingquan Li, Jing Wang, Yulan Guo
Abstract
While feed-forward 3D Gaussian splatting reconstructs renderable Gaussian primitives from sparse context views without per-scene optimization, existing pipelines do not provide a compact scene representation for storage or transmission. A natural solution is to apply existing 3DGS compression methods to the generated Gaussian primitives. However, this approach operates on the final irregular 3D representation and is decoupled from the internal feature-to-Gaussian generation process, which limits compression efficiency. To address this, we introduce CodecSplat, an ultra-compact latent coding framework for feed-forward 3D Gaussian splatting. CodecSplat first encodes an intermediate 2D Gaussian-generation feature into an entropy-coded scene bitstream. At the decoder, the latent feature is reconstructed and used to predict depth and Gaussian parameters, which are then mapped to 3D Gaussian primitives. Note that, by integrating compression into the feed-forward Gaussian generation pipeline, CodecSplat avoids inefficient compression over irregular 3D Gaussian primitives and allows the codec to exploit the structured intermediate feature representation. We instantiate CodecSplat on a feed-forward Gaussian splatting backbone with depth-guided multi-view feature refinement and a hierarchical learned feature codec. On DL3DV and RealEstate10K datasets, CodecSplat achieves 23.56-26.36 dB and 24.76-27.05 dB PSNR with only 20.00-107.77 KiB and 3.37-12.51 KiB per scene, respectively. This is roughly one order of magnitude smaller than compressing feed-forward generated Gaussian primitives, while preserving controllable rate-distortion behavior.