SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

2026-05-18 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

AI summary is being generated…

Authors

Xiao Yang, Ronghao Fu, Zhiwen Lin, Zhuoran Duan, Jiashun Zhu, Jiasen Hu, Lang Sun, Weipeng Zhang, Jiaqi Liu, Xu Na, Haoran Liu, Weijie Zhang, Bo Yang

Abstract

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

View PDFOpen arXiv