Learning Stable Canonical Worlds for Novel View Synthesis and Beyond
2026-06-22 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors propose CanonicalGS, a method that improves how computers create new views of a scene from multiple images. Instead of treating each new image separately and risking confusion from noisy or repeated information, their approach combines all the input into one clear, stable 3D representation of the scene. This makes it easier to generate better-quality new views and helps with tasks like identifying objects in the scene. Tests show that their method produces sharper images and better semantic understanding compared to previous approaches.
Feed-forward Gaussian splattingNovel view synthesisScene representationMulti-view aggregationUncertainty-aware fusionSemantic segmentationPeak signal-to-noise ratioDepth featuresLatent world representation
Authors
Xiaoyu Xu, Jian Zou, Sheyang Tang, Zhihua Wang, Jing Liao, Kede Ma
Abstract
Feed-forward Gaussian splatting (FFGS) facilitates real-time novel view synthesis, yet current methods often remain tied to view-dependent predictions. As more input views are added, they may accumulate noisy or redundant evidence instead of converging to a stable scene representation. In this paper, we introduce CanonicalGS, a feed-forward pipeline that maps cluttered multi-view observations into a stable, scene-centric representation. CanonicalGS first extracts view-centric evidence from depth, semantic features, and uncertainty estimates, and then aggregates this evidence in a canonical latent world using uncertainty-aware fusion. By emphasizing reliable observations while suppressing uncertain or redundant ones, CanonicalGS produces representations that scale more effectively for novel view synthesis and transfer to downstream visual perception tasks. Experiments show up to a $2.5$ dB improvement in peak signal-to-noise ratio for synthesizing novel views and an $11\%$ gain in semantic segmentation accuracy.