Interpretable Probabilistic Medical Image Segmentation via Gaussian Process with Explicit Modelling of Annotation Bias and Variability

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors present a new method to improve how deep learning models handle different medical image annotations from various experts, who may have different biases and variability in their labels. Their method separates the general image information from annotator-specific differences, making it easier to understand how these differences affect the model's predictions. They tested their approach on a dataset with multiple annotators and found it improves the model's uncertainty estimates without losing accuracy. The authors also showed that their model’s parameters can reveal individual annotator behaviors and how changes in these parameters impact prediction results.

deep learningmedical image segmentationmulti-rater annotationprobabilistic modelingGaussian Processuncertainty calibrationannotator biasvariabilitylogit spacevariational inference
Authors
Qi Li, Yuliang Huang, Shaheer U. Saeed, Qianye Yang, Vasilis Stavrinides, Zachary M. C. Baum, Dean C. Barratt, J. Alison Noble, Tom Vercauteren, Yipeng Hu
Abstract
Deep learning-based medical image segmentation models are trained using annotations that exhibit systematic bias and variability across raters. While probabilistic multi-rater approaches can emulate annotator-specific delineations, annotator characteristics are typically encoded implicitly in deep latent feature space, making direct analysis of their influence on predictive distributions less straightforward. We propose a logit-space probabilistic segmentation framework based on stochastic variational Gaussian Process that explicitly decomposes predictions into an image-dependent reference logit distribution and annotator specific perturbations parameterised by bias and variance. This formulation enables more explicit analysis on how intra- and inter-rater variability propagate to predictive distributions. We evaluate the method on a multi-annotator medical image dataset, which shows that explicitly modelling annotator specific perturbations improves uncertainty calibration while maintaining comparable segmentation accuracy, compared with state-of-the-art multi-rater probabilistic segmentation method. The learned bias and variance parameters quantitatively reflect annotator-specific behaviour. Furthermore, controlled perturbation experiments over bias and variance demonstrate how changes in annotator parameters systematically influence predictive performance. The code used in this paper is made publicly available at https://github.com/QiLi111/GPS-Var.