Training-Free Generalized Few-Shot Segmentation through Open-Vocabulary Semantic Arbitration
2026-06-08 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors explore a way to do generalized few-shot semantic segmentation (GFSS) without training new models, using existing foundation models like SAM3 and CLIP. They propose Open-V, a method that combines these models at inference time to recognize and segment new categories from just a few examples. Open-V does not require any additional training and works well across various datasets. The authors also highlight issues with how results are sometimes measured in foundation-model segmentation and show their approach improves performance especially when the models’ built-in knowledge is weaker on new labels.
Generalized Few-Shot Semantic SegmentationFoundation ModelsSegment Anything Model (SAM)CLIPOpen-Vocabulary RecognitionInference-time CoordinationSemantic PriorsmIoUPASCAL-5iTraining-Free Methods
Authors
Silas Kwabla Gah, Ebenezer Owusu
Abstract
Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learning problem, requiring task-specific adaptation to incorporate novel classes from limited support examples. Recent foundation models, however, already exhibit strong open-vocabulary recognition and segmentation capabilities, raising a different question: can GFSS be solved through inference-time coordination of frozen semantic priors rather than parameter adaptation? We answer this question with Open-V, a training-free GFSS framework that combines Segment Anything (SAM3) Promptable Concept Segmentation (PCS) with a K-shot CLIP support centroid through calibrated per-pixel semantic arbitration. OpenV introduces no trainable components and supports arbitrary semantic categories at inference time. Beyond segmentation performance, our study contributes three broader findings. First, we show that support information can be incorporated through inference-time semantic grounding, and that its contribution increases as foundation-model text priors weaken on label-disjoint vocabularies. Second, we identify a reproducibility confound in foundationmodel segmentation, demonstrating that preprocessing and evaluation-space mismatches can silently distort reported performance. Finally, we validate Open-V across PASCAL5i, COCO-20i, and ADE-OW, showing that training-free coordination of foundation-model priors generalizes across both conventional GFSS and open-vocabulary evaluation settings. On PASCAL-5i (1-shot), Open-V attains base/novel/harmonic mIoU of 78.4/77.5/77.9, without GFSS-specific training surpassing the strongest trained baseline by +17.7 HM.