Cross-modal linkage risk in clinical vision-language models

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and LanguageMachine Learning
AI summary

The authors studied vision-language models (VLMs) that link chest X-ray images with their written radiology reports, revealing a privacy risk where an anonymous X-ray could be matched back to its report. They tested this risk on large public datasets, finding that more specialized models made matching much easier than random chance. To reduce this risk, they applied a privacy technique called differential privacy to part of the model without changing its ability to analyze images well. This method significantly lowered the chances of re-linking images to reports while keeping the model useful for medical diagnosis.

Vision-language modelsChest radiographsRadiology reportsPrivacy riskImage-to-report retrievalDifferential privacyRecall@1MIMIC-CXRCheXpert PlusLinear-probe classification
Authors
Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn
Abstract
Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.