Skill-Conditioned Visual Geolocation for Vision-Language

2026-04-10 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors introduce GeoSkill, a new system that helps computers figure out where a photo was taken by using a growing network of simple, understandable geographic skills written in natural language. Unlike past methods, GeoSkill doesn't rely on fixed memory and can improve itself over time by testing ideas on lots of data and learning from mistakes, all without changing its basic programming. Their experiments show that GeoSkill is good at accurately finding locations and explaining its reasoning, and it gets better at understanding geography as it evolves. This approach helps the system learn real-world geographic info more reliably than before.

Vision-language modelsImage geolocationGeographic reasoningSkill-GraphAutonomous evolutionInference modelReasoning rolloutsGeographic biasNatural language skillsSelf-improving systems

Authors

Chenjie Yang, Yutian Jiang, Chenyu Wu

Abstract

Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.

View PDFOpen arXiv