Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization
2026-06-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors focus on the problem of locating objects seen from different viewpoints, like ground or drone images, within satellite images. They created a large new dataset combining ground, drone, and satellite images with detailed labels and camera information to help models learn better. They also introduced a new model called GAGeo that uses 3D geometry and visual cues together to identify object positions and shapes in one step. Their approach aligns different views without needing special training data and works well even on new scenes or view changes, outperforming existing methods.
cross-view object localizationgeo-tagged imagery3D foundation modelmulti-modal promptscamera pose estimationcontrastive losszero-shot learningbounding box predictionsegmentation masksdrone imagery
Authors
Liyao Wang, Ruipu Wu, Haojun Xu, Lei Shi, Linjiang Huang, Si Liu
Abstract
Cross-view object geo-localization (CVOGL) aims to locate a target object from a query view (e.g., ground or drone) within a geo-tagged reference image (e.g., satellite). Existing approaches heavily rely on 2D appearance matching and are constrained by limited datasets lacking geometric metadata, diverse prompts, and standard field-of-view imagery. To address these intertwined challenges, we first introduce \dataset, a large-scale, high-fidelity building dataset comprising over 220,000 ground-satellite and drone-satellite pairs. It provides multi-modal prompts (points, boxes, masks) and camera poses to enable flexible target referring and explicit spatial modeling. Furthermore, we propose a novel single-stage Geometry-Aware Geo-localization framework (GAGeo), built upon the permutation-equivariant 3D foundation model $π^3$. By seamlessly integrating visual features, referring prompts, and learnable task tokens, our model adapts the inherited 3D prior to jointly predict bounding boxes, segmentation masks, and camera poses in a single forward pass. Additionally, we introduce a contrastive loss that utilizes the satellite view as a universal anchor, implicitly aligning ground and drone representations to enable zero-shot ground-to-drone localization without requiring triplet training data. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, exhibiting exceptional generalization ability in unseen scenes and novel cross-view setups.