VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing Natural Scenes

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMultimedia
AI summary

The authors explain that current large language models that use multiple types of data struggle to understand 3D shapes from remote sensing images because benchmarks mostly focus on 2D images. They created a new test called VertiCue-Bench that checks if models can use height information from Canopy Height Models (CHMs) to tell apart similar-looking areas based on their 3D structure. Their tests showed that while models can detect height data, they have trouble using it to correctly understand what things are in the scene. This reveals a gap between noticing height and using that info for real understanding in natural environments.

Multimodal Large Language ModelsRemote SensingCanopy Height ModelGeospatial ReasoningSpectral Confusion3D StructureSemantic ReasoningBenchmarkNatural Scene Understanding
Authors
Jing Huang, Duanchu Wang, Junjie Yang, Zihang Cheng, Cheng Li, Lin Cui, Zhouyi Wu, Di Wang
Abstract
Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.