Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

2026-06-02 • Computation and Language

Computation and LanguageComputer Vision and Pattern Recognition

AI summaryⓘ

The authors created IndoRad-VQA, a version of a medical question-answer dataset translated into Indonesian, to see if medical vision-language models (VLMs) work well with Indonesian clinical questions like they do with English ones. They made sure translations kept medical terms accurate and compared model performances using both languages. Their tests showed that models that do well in English lose between 8 to 25 percent accuracy when answering in Indonesian. The authors point out this means evaluating these models only in English isn’t enough, and we need better multilingual testing for medical AI tools.

Medical Vision-Language ModelsVisual Question AnsweringRadiologyMultilingual EvaluationIndonesian LanguageDataset TranslationModel RobustnessClinical LanguageAnswer EquivalenceError Analysis

Authors

Pieter Christy Yan Yudhistira, Dzaki Rafif Malik, Novanto Yudistira

Abstract

Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.

View PDFOpen arXiv