ATV-Net: Adaptive Triple-View Network with Dynamic Feature Fusion

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors looked at improving a classic image segmentation model based on ResNet-101 by only changing its final part, called the segmentation head. They created ATV-Net, which uses three different ways to look at the image: tiny details, nearby areas, and bigger context. Instead of mixing these views equally, their model decides which view is most important depending on the scene. Testing on Cityscapes data showed this simpler approach still works well, challenging the idea that newer, more complex transformer models are always better.

semantic segmentationResNet-101segmentation headreceptive fieldadaptive fusionCNNmIoUCityscapes datasettransformers
Authors
Hsin-Jui Pan, Sheng-Wei Chan, Meng-Qian Li, Chun-Po Shen
Abstract
Recent semantic segmentation research has increasingly moved toward stronger context modeling, dense attention, and transformer-based architectures. Although these models achieve impressive performance, classical CNN-based segmentation pipelines remain attractive because of their simplicity, efficiency, and ease of implementation. This paper revisits a practical question: how far can a ResNet-based segmentation model be improved by only modifying the segmentation head? We propose ATV-Net, an Adaptive Triple-View Network that strengthens a ResNet-101 backbone using three simple but complementary receptive-field views. The micro view captures point-wise semantic responses, the local view models neighborhood structures and object boundaries, and the scout view provides enlarged contextual cues. Instead of fusing these views with fixed weights, ATV-Net introduces an Adaptive Decision Gate that dynamically selects receptive-field responses according to input scene characteristics. A compact global coordination layer is further applied to improve spatial and semantic consistency. Experiments on the Cityscapes validation set show that ATV-Net achieves 80.31\% mIoU. This result suggests that classical CNN-based segmentation is still far from obsolete: with simple receptive-field views and adaptive fusion, a ResNet-based pipeline can reach a competitive accuracy level without relying on transformer-style global attention or overly complex context modules.