G-MASt3R-SfM: Graph-based View Pruning and Multi-stage Optimization for Robust SfM

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors improve 3D reconstruction from multiple images by fixing problems with matching points between images. They found that a recent method called MASt3R sometimes matches images that don't overlap, causing errors. To solve this, they created G-MASt3R-SfM, which first removes bad image matches using a graph-based approach and then refines the camera positions step-by-step. Their tests show better accuracy in estimating camera poses and building 3D models, especially by reducing mistakes from incorrect matches.

Structure from Motion3D ReconstructionImage MatchingMASt3RCorrespondence MatchingCamera Pose EstimationGraph-based View PruningMulti-Stage OptimizationOutlier DetectionETH3D Dataset
Authors
Toshiki Watanabe, Shintaro Ito, Natsuki Takama, Koichi Ito, Takafumi Aoki
Abstract
Structure from Motion (SfM) is essential for multi-view 3D reconstruction, however, its accuracy heavily relies on the accuracy of image matching. While the recent correspondence matching method, MASt3R, enables robust matching even under challenging conditions, it tends to generate incorrect correspondences for non-overlapping image pairs. Consequently, existing SfM methods using MASt3R, such as MASt3R-SfM, suffer from significant degradation in pose estimation accuracy as they incorporate these unreliable matches directly into optimization. To address this issue, we propose G-MASt3R-SfM, a novel SfM pipeline that enhances robustness through two key modules. First, the Graph-based View Pruning (GVP) module constructs a scene graph from matching confidence and geometrically prunes outlier views. Second, the Multi-Stage Optimization (MSO) module progressively refines camera parameters by expanding the optimization scope from local consistency to the global consistency. Experiments on the ETH3D dataset demonstrate that our method achieves state-of-the-art accuracy in both camera pose estimation and 3D reconstruction, effectively suppressing noise caused by outliers.