UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present UniverSat, a new type of Vision Transformer designed for Earth Observation data, which often comes from different sensors and has varied resolutions. Instead of using fixed patch projectors, their Universal Patch Encoder can handle patches from many kinds of data in a unified way. This allows training one model on mixed data types using self-supervised learning. They show strong results on several standard Earth Observation tasks like classification and segmentation.

Vision TransformerEarth Observationpatch encoderself-supervised learningmultimodal dataclassificationsegmentationGeoBenchPANGEABenchSpectralEarth

Authors

Yohann Perron, Guillaume Astruc, Nicolas Gonthier, Clement Mallet, Loic Landrieu

Abstract

Vision Transformers (ViT) dominate computer vision. However, their reliance on rigid patch projectors hinders transfer to Earth Observation (EO), where input modalities, scales, and resolutions vary widely. We introduce UniverSat, a ViT-style backbone built around a Universal Patch Encoder that maps patches from arbitrary spatial, spectral, and temporal resolutions, and from both optical and non-optical sensors, into a shared embedding space with a shared set of weights. This enables training a single model on heterogeneous multimodal corpora via self-supervision, yielding robust, sensor-agnostic spatial features. We validate this approach with strong results across classification and segmentation on standard EO benchmarks from GeoBench, PANGEABench, and SpectralEarth. Our code and models are available at https://github.com/gastruc/UniverSat.

View PDFOpen arXiv