SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

2026-06-18 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceDatabases

AI summaryⓘ

The authors created a new large dataset that combines very high-resolution radar images, aligned optical photos, and written descriptions from many places around the world. Unlike earlier datasets, theirs keeps detailed radar signal information and matches it precisely with optical images, allowing better study across different types of data. They also provide captions of different lengths for the images to help train AI models that understand both images and language. This dataset, including code and splits for testing, is freely available for researchers to use in comparing how well computers can link radar, optical images, and text.

Synthetic Aperture Radar (SAR)Complex-valued dataGround Range Detected (GRD)Slant-range gridBand-limited FFT resamplingMultimodal learningOptical imageryNatural language captionsCross-modal retrievalConditional generation

Authors

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, Georgia Channing

Abstract

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.

View PDFOpen arXiv