Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation
2026-05-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors introduce GraphDepth, a method to estimate depth from a single image by combining graph neural networks with traditional convolutional neural networks. They use graphs to better understand long-range relationships in images, which helps improve depth prediction without the heavy computation that transformer models require. Their approach is faster and uses less memory while achieving accuracy close to top transformer-based methods. They also include ways to handle uncertainty in predictions and show their model performs well on various indoor and aerial datasets, even when tested on data different from what it was trained on.
Monocular Depth EstimationGraph Neural NetworksConvolutional Neural NetworksResNetU-NetGraphSAGEk-Nearest Neighbors (k-NN)Aleatoric UncertaintyTransformer ModelsCross-Domain Generalization
Authors
Ishan Narayan
Abstract
We present GraphDepth, a monocular depth estimation architecture that synergistically integrates Graph Neural Networks (GNNs) within a convolutional encoder-decoder framework. Our approach embeds efficient GraphSAGE layers at multiple scales of a ResNet-101 U-Net backbone, enabling explicit modeling of long-range spatial relationships that lie beyond the receptive field of local convolutions. Key technical contributions include: (1) batch-parallelized graph construction with configurable k-NN and grid-based adjacency for scalable training; (2) multi-scale GraphSAGE integration at bottleneck and decoder stages (1/32, 1/16, 1/8 resolution) to propagate global context throughout the feature hierarchy; (3) channel-attention gated skip connections that adaptively weight encoder features before fusion; and (4) heteroscedastic uncertainty estimation via a dedicated aleatoric uncertainty head, enabling confidence-aware loss weighting during optimization. Unlike transformer-based hybrids, which suffer from quadratic complexity in sequence length, GraphDepth scales linearly with spatial resolution while achieving comparable global receptive fields through iterative message passing. Experiments on NYU Depth V2, WHU Aerial, ETH3D, and Mid-Air benchmarks demonstrate competitive accuracy within 4.6\% of state-of-the-art transformers on indoor scenes with substantially lower computational cost (25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM). GraphDepth achieves the best reported result on WHU Aerial (RMSE 8.24 m) and exhibits superior zero-shot cross-domain transfer to the Mid-Air synthetic aerial dataset, validating the generalization power of explicit relational reasoning for depth estimation.