AI summaryⓘ
The authors created a system called RemoteAgent to help Earth Observation AI better understand unclear human requests that vary in detail. They made a new dataset, VagueEO, which pairs real-world Earth Observation tasks with natural but imprecise language to teach the AI how people usually ask questions. Using this dataset, they trained a Large Language Model (MLLM) to handle many tasks directly and only call specialized tools when very detailed analysis is needed. Their approach improves efficiency and accuracy in interpreting satellite images based on vague instructions. Experiments show that RemoteAgent can recognize user intent well and perform various Earth Observation tasks competitively.
Earth ObservationMulti-modal Large Language ModelsNatural Language QueriesReinforcement Fine-tuningSparse Region-Level TasksDense PredictionsAgentic FrameworkVisual PrecisionModel Context Protocol
Authors
Liang Yao, Shengxiang Xu, Fan Liu, Chuanyi Zhang, Bishun Yao, Rui Min, Yongjun Li, Chaoqian Ouyang, Shimin Di, Min-Ling Zhang
Abstract
Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.