GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training

2026-07-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors study a way to match images to 3D scenes without using visual descriptors, which are normally large and hard to manage. They find that previous methods relying only on geometry don’t use local and global spatial cues well and tend to depend on one keypoint detector. To fix this, the authors introduce GeoMix, which improves local and global geometry understanding and can learn from multiple keypoint detectors at once. Their experiments show that GeoMix performs much better than earlier descriptor-free methods, getting closer to traditional descriptor-based accuracy.

visual localizationdescriptor-free matchingkeypoint detectorsgeometry-only matching2D-3D matchingspatial embeddingcross-attentionmulti-detector trainingrotation errortranslation error

Authors

Yejun Zhang, Xinjue Wang, Zihan Wang, Esa Rahtu, Juho Kannala

Abstract

Descriptor-free visual localization eliminates high-dimensional descriptor storage, preserves scene privacy, and simplifies map maintenance, yet its accuracy still lags far behind descriptor-based pipelines. We identify this gap to insufficient geometric discriminability in geometry-only matching. Without visual appearance, current methods underutilize local geometry cues, lack the global context among keypoints, and overfit to a single keypoint detector. We further observe that descriptor-free matching naturally enables multi-detector training, as heterogeneous keypoints can be optimized in a shared geometry-only space without aligning descriptor spaces. Building on these insights, we propose GeoMix, a descriptor-free 2D-3D matching framework that strengthens geometric discriminability at three levels. Locally, directional and distance-aware embeddings enrich neighborhood aggregation with fine-grained spatial structure. Globally, learnable context nodes aggregate and redistribute scene-wide information via cross-attention to resolve ambiguities beyond local receptive fields. At the training level, Mix-Training exploits this detector-agnostic geometry space to learn representations across multiple keypoint detectors. Extensive experiments on MegaDepth, Cambridge Landmarks, 7Scenes, and Aachen Day-Night show that GeoMix sets a new state of the art among descriptor-free methods, reducing 75th-percentile rotation error by 89\% and translation error by up to 90\% over the previous best, while generalizing zero-shot to unseen detectors and narrowing the gap to descriptor-based pipelines. Code is available at $\href{https://github.com/YejunZhang/Geomix}{\text{this links}}$.

View PDFOpen arXiv