OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors focus on improving a method for identifying different objects in images without extra training, called training-free open-vocabulary semantic segmentation. Current methods look at small parts of a big image separately, which misses the bigger picture. The authors propose OV-Stitcher, a way to combine information from these small parts inside the model to help it understand the whole image better. This leads to clearer and more accurate object maps in images, as shown by tests on eight benchmarks where their method improved performance.

open-vocabulary semantic segmentationtraining-freepretrained encoderssliding-window strategyglobal attentionfeature representationcontextual reasoningmean Intersection over Union (mIoU)vision-language models

Authors

Seungjae Moon, Seunghyun Oh, Youngmin Ro

Abstract

Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.

View PDFOpen arXiv