Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?

2026-06-15Computation and Language

Computation and Language
AI summary

The authors studied how well vision-language models (VLMs) used in clinical settings can signal when their answers might be wrong, which is called uncertainty estimation (UE). They tested several methods and found that UE tends to get worse exactly when the model's accuracy drops, meaning it’s less helpful when it’s needed most. They also showed that even when models fail badly under tricky conditions, their uncertainty scores don’t always reflect this failure, causing miscalibration. However, the authors discovered that the uncertainty on normal inputs can still hint at which answers might fail in tricky situations, suggesting UE can help spot fragile predictions. This points to using perturbation tests as a step to deploy these models safely in healthcare.

vision-language modelsuncertainty estimationclinical VQAmodel calibrationdiagnostic toolperturbation testingmodel accuracyNOTA perturbationprediction reliabilitymodel fragility
Authors
Arnisa Fazla, Alberto Testoni, Ameen Abu-Hanna, Barbara Plank, Iacer Calixto
Abstract
Safe deployment of clinical vision-language models (VLMs) requires reliable uncertainty estimation (UE): a signal indicating when predictions should be trusted or escalated to a clinician. We test whether current UE methods actually deliver this signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering (VQA), we find that UE quality is not an intrinsic property of the UE method: it tracks model accuracy, degrading precisely where the model performance is weakest, and therefore where reliability is most needed. When we stress-test models by hiding the correct option among the multiple-choice answers (NOTA perturbations), accuracy collapses while uncertainty barely changes, leaving models systematically miscalibrated. Yet, we find that uncertainty on the unperturbed input reliably anticipates which predictions will collapse under NOTA, indicating that UE in current VLMs carries diagnostic information about model fragility. Our results position UE as a diagnostic tool for identifying fragile predictions and motivate perturbation-based evaluation as a path toward safe clinical deployment.