AI summaryⓘ
The authors studied how to make large language models (LLMs) better at understanding medical incident reports by choosing examples that help the model learn. They tested three ways to pick these examples: random choice, picking based on text similarity, and their new method using tags that describe the reports. Using real Japanese medical incident data, they found that the tag-based method made the models give more accurate and consistent answers about causes and prevention. The tag-based approach worked better than similarity-based selection, which sometimes caused errors or safety problems. This suggests that choosing examples based on clear, human-readable tags can make medical AI tools safer and more reliable.
Large Language ModelsFew-shot learningExample selectionMedical incident reportsTag-based selectionCosine similarityGPT-4oLLaMA 3.3Preventive measuresClinical AI safety
Authors
Yuna Haseyama, Tomoki Ito, Hiroki Sakaji, Itsuki Noda
Abstract
In high-stakes domains such as healthcare, the reliability of Large Language Models (LLMs) is critical, particularly when generating clinical insights from incident reports. This study proposes a tag-based few-shot example selection method for prompting LLMs to generate background/causal factors and preventive measures from details of the medical incidents. For our experiments, we use the Japanese Medical Incident Dataset (JMID), a structured dataset of 3,884 real-world medical accident and near-miss reports. These reports are variably annotated with a wide range of tags--some include descriptive information (e.g., "medications," "blood transfusion therapy"). We compare three few-shot example selection strategies--random sampling, cosine similarity-based selection, and our proposed tag-based method--using GPT-4o and LLaMA 3.3. Results show that the tag-based approach achieves the highest precision and most stable generation behavior, while similarity-based selection often leads to unintended outputs and safety filter activation. These findings suggest that selecting examples based on human-interpretable dataset tags can improve generation precision and stability in clinical LLM applications.