Explainable Flood Segmentation on Sentinel-1 SAR Imagery: A Comparative Study of CNN and Transformer Architectures

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors compare different types of computer models that look at satellite radar images to identify flooded land, permanent water, and dry land. They test three popular convolutional neural networks and three vision transformer models to see which predicts floods more accurately. Their results show that one vision transformer model, SegFormer-b2, generally does better than a common CNN model, especially in certain flood types, though the difference becomes smaller after extra training on another dataset. They also study how these models make decisions and how confident they are in their predictions. Overall, the authors highlight strengths and weaknesses in both model types for flood mapping.

Synthetic Aperture Radar (SAR)Flood segmentationConvolutional Neural Network (CNN)Vision TransformerU-NetSegFormerSentinel-1Flood IoUGrad-CAMSpatial generalization
Authors
Arundhuti Banerjee, David Daou
Abstract
Rapid and accurate flood prediction is essential for disaster response and mitigation planning. Synthetic Aperture Radar (SAR) sensors in satellites are well-suited for this purpose because they operate independently of weather and daylight conditions. Although SAR-based data enable all-weather flood monitoring, distinguishing flooded land from permanent water remains a significant challenge, particularly when flooding is defined strictly as inundated land. This study provides a comprehensive comparison of convolutional neural network (CNN) and vision transformer architectures for multi-class flood segmentation using Sentinel-1 SAR imagery, specifically trained to separate flooded land from permanent water bodies and land. Three state-of-the-art (SOTA)CNN-based models, U-Net, U-Net++, and DeepLabV3 with ResNet-34 backbone, and three SegFormer variants (b0,b1,b2) were evaluated in two benchmark datasets, the ETCI NASA dataset and SenFloods11, using scene-based data splits to ensure a realistic assessment of spatial generalization. The results demonstrate that SegFormer-b2 significantly outperforms the U-Net baseline on the ETCI dataset (higher flood IoU across all 7 test scenes in the Wilcoxon signed-rank test), while after fine-tuning on Sen1Floods11, the advantage narrows to within the range of scene variability and is concentrated in spatially fragmented flood events. The study includes both qualitative and quantitative explainability techniques to visually comprehend model decisions and systematically assess prediction reliability. Qualitative analysis reveals that SegFormer-b2 produces more spatially coherent Grad-CAM activations focused on flood-relevant features, while U-Net generates more informative uncertainty estimates along flood boundaries.