Visual Geometry Transformer in the Wild: Distractor-Free 3D Reconstruction

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors noticed that current 3D reconstruction methods work well only when scenes are perfect and free of distracting objects, which is unrealistic. They created a new method called Visual Geometry Transformer in the Wild (VGTW) that can handle messy, real-world images by learning to ignore distractions and focus on consistent parts across different views. Their model uses a special training technique with masks to identify and filter out these distractors, producing clean 3D point clouds without extra 3D data or heavy computation. Tests showed their approach works better and is more reliable in everyday situations.

3D reconstructionmulti-view geometrytransformerdistractor suppressionattention mechanismpoint cloudmask predictionfeature consistencyend-to-end learningdataset annotation
Authors
Tianbo Pan, Xingyi Yang, Shizun Wang, Xinchao Wang
Abstract
Current end-to-end multi-view 3D reconstruction methods achieve impressive results, but rely on a restrictive static assumption: the scenes is entire distractor-free with perfect cross-view geometry. This reliance on idealized inputs causes even the most advanced methods to fail in real-world settings, where transient distractors and occlusions present. To address this, we propose Visual Geometry Transformer in the Wild (VGTW), an end-to-end framework for robust reconstruction from inconsistent views. At its core, we isolate and suppress distractor-affected regions while preserving the consistent components across views. Specifically, we introduce a Distractor-aware Training (DAT) strategy that separates clean features from distractor-contaminated ones in the attention mechanism while enforcing feature consistency across images. To enable this, we train the model with an auxiliary mask prediction head, using supervision from a new dataset we collected with pixel-level distractor masks. The resulting VGTW model is a feed-forward network that directly outputs clean, distractor-free point clouds. Remarkably, it requires no additional 3D supervision, remains computationally efficient, and is compatible with existing pipelines. Extensive experiments validate our approach, demonstrating state-of-the-art performance and robust generalization in diverse, real-world scenarios.