Vision-Language Models as Zero-Annotation Oracles in Histopathology

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose a new method to separate tissue from background in pathology images, which is a crucial step for analyzing medical slides. Instead of relying on specialized models that struggle with unusual stains, they use vision-language models trained on large collections of images and text from the internet to identify tissue. Their method works better on difficult and rare stains and can even improve tough cases with a small number of examples. They also show their approach can help review annotations as well as human experts and create smaller efficient models for wider use. This approach aims to provide a more reliable and scalable tool for digital pathology workflows.

foreground segmentationdigital pathologyvision-language modelswhole-slide imagesstain variabilityzero-shot learningDice coefficientpseudo-labelsfew-shot promptingannotation review
Authors
Vishal Jain, Giorgio Buzzanca, Sarah Cechnicka, Maarten Naesens, Priyanka Koshy, Tri Nguyen, Jesper Kers, Candice Roufosse, Bernhard Kainz
Abstract
Foreground segmentation is the critical first step of every computational pathology pipeline, yet existing methods rely on hand-tuned heuristics or supervised models that overfit to narrow stain and scanner distributions, failing silently on specialised stains such as Jones silver or Elastica van Gieson. We propose a coarse-to-fine approach that recasts foreground segmentation as a visual perception task and leverages general-purpose vision-language models (VLMs) as zero-annotation oracles. Our key insight is that tissue-versus-background discrimination is a natural-image recognition problem, not a histopathological one, so VLMs trained on internet-scale corpora generalise where domain-specific models cannot. We introduce Leica-75, a benchmark of 75 renal transplant whole-slide images spanning three stain families. On Leica-75, our method achieves the highest segmentation quality on out-of-distribution stains (Dice 0.858 +/- 0.027 on Jones, 0.853 +/- 0.041 on EVG) with 7x lower cross-stain variance than the best supervised baseline, while remaining competitive on in-distribution H&E. Few-shot prompting with automatically curated exemplars (Auto-context) rescues hard cases on Stress-32 (n=32), a curated stress-test subset (Dice 0.470 to 0.819 for the 2B model). VLM-based annotation review matches human expert consensus (kappa=0.989 for blur detection; mean precision/recall grading accuracy 0.708 vs. human 0.646 for segmentation mask review). The resulting pseudo-labels are used to distil lightweight student models that are as performant as the teacher model while running for a fraction of the cost. Our framework provides a principled, scalable solution to a persistent infrastructure bottleneck in digital pathology.