Honey, I Shrunk the Arc de Triomphe!

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors found that current 3D depth estimation models struggle to correctly judge distances for faraway objects, often thinking they are closer than they really are. They believe this is because training data mostly comes from limited sources like vehicle-mounted sensors or indoor scans, which don't cover diverse real-world scenes well. To fix this, they created a new diverse dataset called MetricScenes using photos and stereo images from the internet, adding real-world scale information from GPS data and camera setups. By fine-tuning existing models with this dataset, the authors improved the accuracy of measuring distances in varied outdoor scenes without losing performance on standard tests.

monocular geometry estimationscale collapsemetric scaledepth mapscamera pose estimationstereo imageryPoisson completiongeo-taggingfine-tuningopen-domain scenes
Authors
Yuanbo Xiangli, Hanyu Chen, Xueqing Tsang, Noah Snavely
Abstract
Metric scale monocular geometry estimation has seen significant progress through large-scale data aggregation, yet current foundation models suffer from a persistent ''scale-collapse'' phenomenon: distant landmarks and vast landscapes are metrically underestimated. We hypothesize that this performance gap stems from a training data bottleneck, where existing metric-scale datasets are hardware-constrained to homogenous vehicle-captured LiDAR or short-range indoor scans, or consist of synthetic data that lacks the semantic complexity of the physical world. To bridge this gap, we curate a new metrically-grounded, in-the-wild dataset that we call MetricScenes, gathered from a variety of sources including Internet photo collections and stereo imagery. We estimate camera poses and initial depth maps for each scene using off-the-shelf methods, and recover absolute scale from geo-tagged metadata as well as known stereo camera baselines. We also improve the quality of depth maps derived from MetricScenes via a new two-stage Poisson completion method. Fine-tuning MoGe-2 on our dataset significantly mitigates scale-collapse and achieves superior metric accuracy in unconstrained, open-domain scenes while maintaining state-of-the-art performance on standard benchmarks.