Rethinking Prototype-based Similarity Learning for Few-Shot Object Detection

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors study few-shot object detection, which means teaching a computer to spot new objects with very few examples. They found that existing methods often confuse similar objects and lack detail for precise object location. To fix this, they created two new tools: TSMa, which uses text features to better tell different objects apart, and SHARe, which improves location accuracy by refining boxes in stages using different layers of a vision transformer. Their approach works much better than before on a standard test dataset called COCO.

few-shot object detectionprototype-based similarityclass confusionsemantic masktext featuresVision Transformer (ViT)hierarchical regressionbounding box localizationCOCO dataset
Authors
KunHo Heo, Seungjae kim, Wongyu Lee, SuYeon Kim, MyeongAh Cho
Abstract
Few-shot object detection aims to detect novel object categories from only a few labeled examples, avoiding costly large-scale annotation. Recent prototype-based similarity learning approaches enable training-free adaptation by matching query features with class prototypes. However, they suffer from two fundamental limitations: (i) class confusion arising from inter-class similarity margin collapse, and (ii) insufficient visual cues for precise localization, as similarity scores capture only class-level semantic affinity while providing limited spatial information. To address these issues, we introduce two complementary components. Text-Anchored Semantic Mask (TSMa) leverages class-level text features as semantic anchors to identify semantically aligned channels through channel-wise interaction between visual and text features. By suppressing style-induced spurious responses and emphasizing class-intrinsic signals, TSMa enlarges inter-class similarity margins and mitigates class confusion. We further propose Stage-Aligned Hierarchical Autoregressive Regression (SHARe), which reformulates localization as a hierarchical autoregressive process that progressively refines bounding boxes across multiple stages. SHARe leverages the layer-wise characteristics of ViT representations by aligning feature abstraction levels with regression stages: deeper layers guide early coarse localization, while shallower layers rich in edge and texture cues refine spatial details in later stages. Experiments on COCO demonstrate a new state of the art, outperforming the previous best by +10.1 nAP, with extensive analysis validating each component. The code is available at https://github.com/VisualScienceLab-KHU/ReSet.