LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs
2026-04-09 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied how to use advanced language models to help train French medical students by creating and scoring practice doctor-patient interviews automatically. Because real French training data is very rare, they made a system that generates pretend interviews showing different skill levels and uses language models to give feedback. They tested several models and found that even medium-sized ones can score these practice interviews almost as well as the best models. This approach could help provide more practice and feedback without needing many human teachers and keep data private.
OSCENatural Language ProcessingLarge Language ModelsFrench medical educationsynthetic data generationautomatic evaluationclinical communication skillsprivacy-preserving AIbenchmarkinglanguage model parameters
Authors
Tian Huang, Tom Bourgeade, Irina Illina
Abstract
Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.