Concept Alignment Contrast and Long-Short Prompt Memory for Test-Time Adaptation of SAM3 in Medical Image Segmentation
2026-06-22 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors tackle the problem that a popular image segmentation model (SAM3) works well on natural images but struggles with medical images because they look very different. They propose a new method called CM-TTA to improve the model's performance during testing without needing extra labels. This method uses a smart way to pick the best image version to learn from and balances quick changes with stable learning over time. Their experiments show this approach works better than existing ones for segmenting prostate and skin lesion images.
Segment Anything Model 3 (SAM3)Test-Time Adaptation (TTA)Medical Image SegmentationConcept Alignment Contrast (CAC)Long-Short Prompt Memory (LSPM)Densely Supervised Prompt Update (DSPU)Vision-Language ModelsPseudo-labelingDomain GapSemantic Consistency
Authors
Yubo Zhou, Jianghao Wu, Ping Ye, Shaoting Zhang, Guotai Wang
Abstract
Concept segmentation models like Segment Anything Model 3 (SAM3) show strong generalization on natural images, yet their performance degrades in medical imaging due to the domain gap caused by different imaging principles and styles. Test-Time Adaptation (TTA) is essential for improving the testing performance by updating the model on the fly without annotations. However, existing vision-language TTA methods are mainly driven by image-level uncertainty minimization, which does not necessarily reflect region-level semantic correctness in medical segmentation. Moreover, they often lack mechanisms to maintain stability in continual one-pass adaptation, leading to limited performance when reliable dense supervision is missing for segmentation. To address these issues, we propose Concept Alignment Contrast and LongShort Prompt Memory for Test-Time Adaptation (CM-TTA) of SAM3 for medical images. First, for a test sample with multiple augmentations, we introduce a novel Concept Alignment Contrast (CAC) metric, which leverages textual-visual semantic consistency to robustly evaluate prediction quality to select the best augmented view as the supervision. Second, to balance rapid and stable adaptation, we design a Long-Short Prompt Memory (LSPM) module. The short memory dynamically fuses recent prompts based on CAC scores for agile local adaptation, while the long memory maintains a stable global prompt to generate enhanced pseudo-labels. Finally, a Densely Supervised Prompt Update (DSPU) strategy is proposed to optimize the prompt embeddings with enhanced pseudo labels as dense supervision. Extensive experiments on prostate and skin lesion segmentation demonstrate that our CM-TTA framework significantly outperforms existing methods for TTA of SAM3.