SHOVIR: A Benchmark for Evaluating Vision Shortcut Learning in Radiology Report Generation

2026-06-29Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and Language
AI summary

The authors found that current methods for checking how well AI models write radiology reports mostly look at overall correctness but miss if the AI truly uses the images' details. They created SHOVIR, a new test that blocks out parts of X-ray images to see if the AI still makes correct statements about those areas. Testing eight advanced models, they revealed some models give good reports but don't actually rely on the image evidence properly. This shows existing evaluation can miss important flaws, suggesting that future tests need to focus more on which image regions the AI uses.

Vision-Language ModelsRadiology Report GenerationChest X-rayCheXpert labelsImage occlusionSpatial groundingDiagnostic statementsMIMIC-CXRPadChest-GREvaluation protocols
Authors
Filippo Ruffini, Marco Salmé, Rosa Sicilia, Valerio Guarrasi, Paolo Soda
Abstract
Current evaluation protocols for Vision-Language Models (VLMs) in Radiology Report Generation (RRG) rely on report-level metrics that measure lexical overlap or aggregate clinical correctness. However, such metrics do not test whether individual diagnostic statements stem from the actual pathological evidence visible in the image. This allows models to achieve competitive scores by exploiting learned priors or spurious correlations, a failure mode we refer to as vision shortcut. We introduce SHOVIR, a benchmark for evaluating vision shortcut behavior in RRG. SHOVIR extends two spatially annotated chest X-ray datasets, MIMIC-CXR and PadChest-GR, with per-box CheXpert labels, and defines image-level and disease-level occlusion experiments that contrast baseline performance on clean images against localized, region-specific perturbations. Comparing predictions across these conditions isolates two failure modes at the disease-class level: direct shortcuts, where a finding persists after its visual evidence is removed, and contextual shortcuts, where detection degrades once co-occurring pathologies are occluded despite the target region remaining intact. Benchmarking eight state-of-the-art VLMs, we find that shortcut behavior varies substantially across architectures and datasets. Models achieving the highest baseline report quality do not necessarily rank highest in spatial grounding, revealing that clinically fluent generation can coexist with shallow reliance on visual evidence. These findings expose a blind spot in current RRG evaluation and motivate region-aware assessment protocols.