Efficient RGB-T Object Detection via Sparse Cross-Modality Fusion
2026-06-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors recognized that most parts of RGB-T images are simple backgrounds that don't need complicated processing. They designed a two-step method where a fast, simple detector first spots possible objects, then a more detailed fusion process looks closely at these spots to improve accuracy. This way, their approach saves computing power by focusing effort only where it's needed. Their experiments showed this method works well while being more efficient and scalable for big images.
RGB-T detectionthermal infraredvisible lightfeature fusionregion of interest (RoI)two-stage detectioncomputational efficiencyforeground-background separationobject detectionhigh-resolution images
Authors
Chao Tian, Zikun Zhou, Chao Yang, Guoqing Zhu, Zhenyu He
Abstract
RGB-T detectors leverage the complementary strengths of visible and thermal infrared modalities, achieving robust performance under challenging conditions. Many of them resort to heavy dual backbones and exhaustive cross-modality fusion across the entire image, leading to impractically high computational costs. We observe that most image regions are smooth backgrounds (e.g., sky, ground) that can be easily handled by lightweight single-modality models. In light of this observation, we propose a sparse fusion mechanism for efficient RGB-T detection: first rapidly scanning the image to identify the proposals and then carefully examining the remaining sparse proposals via feature fusion. We propose a two-stage framework to instantiate this mechanism, which performs detection in two stages: 1) a lightweight and modality-specific detection stage that produces high-recall RoIs, and 2) a fusion-driven examination and refinement stage that filters out the false positives and refines the bounding boxes. This design enables the detector to adaptively allocate more computational resources to the potential foregrounds, improving the efficiency while ensuring detection accuracy. Extensive experiments show that our method achieves competitive performance with substantially fewer parameters and lower cost, while maintaining strong scalability to high-resolution images.