\textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

2026-06-02Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors study a way to separate objects in images based on complicated descriptions by combining visual and language understanding. They point out problems with existing methods that either struggle to connect language and images well or lose important spatial details. To fix this, they create a two-step system called CR-Seg that first roughly finds objects using attention maps and then refines the segmentation using selected points. They also introduce a new reasoning method that helps the model think from the whole image down to specific parts. Their experiments show that this approach works better on standard tests for reasoning-based segmentation.

reasoning segmentationmultimodal large language modelscross-modal alignmentattention mapssegment anything model (SAM)chain-of-thought reasoningcoarse-to-refined segmentationglobal-to-local reasoningvisual-textual reasoning
Authors
Yifan Cao, Xiaocui Yang, Faxian Wan, Shi Feng, Daling Wang, Yifei Zhang
Abstract
Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing methods typically rely on either learned semantic tokens to bridge Multimodal Large Language Models (MLLMs) and segmentation models, suffering from difficult cross-modal alignment, or explicit spatial prompts such as bounding boxes, which may lose holistic response semantics. To address these limitations, we propose Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation, termed CR-Seg, a two-stage framework for coarse-to-refined reasoning segmentation. Specifically, we design an Extract Attention Maps and Points (EAP) module to extract attention maps for coarse target localization and select informative points, both of which are fed into SAM for mask refinement. To alleviate reasoning--answer inconsistency, we further introduce Global-to-Local Chain-of-Thought (GLCoT), which guides the model to reason progressively from global scene context to local target details. Extensive experiments on reasoning segmentation benchmarks demonstrate the effectiveness of CR-Seg.