LLM-Based Visual Explanation Evaluation Framework for Assessing the Explainability of Facial Skin Disease Classification Models

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a special way to check if AI explanations for skin disease images actually focus on the right spots on the face. They used different methods to change the images and tested several AI models for skin disease classification. Then, they used a type of AI explanation called Grad-CAM to see why the models made certain decisions. To judge these explanations, the authors designed a system using advanced language models to score how well the explanations matched important lesion areas and how trustworthy they were. They also developed a step-by-step method to make the evaluation clearer and more reliable using clinical knowledge.

LLMGrad-CAMfacial skin disease diagnosisEfficientNet-B0MobileNetV3ResNet18data augmentationvisual explanationprompt engineeringexplanation evaluation
Authors
Gyuyeon Na
Abstract
This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. While previous studies have primarily focused on improving classification performance through data augmentation techniques, relatively few studies have systematically examined whether model explanations are grounded in clinically relevant lesion regions. In this study, geometric augmentation, color-based augmentation, and mixed augmentation strategies were applied to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models' decision-making processes. Furthermore, an LLM-as-a-Judge evaluation framework was designed using GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6 to assess Grad-CAM explanations from the perspectives of lesion localization and explanation trustworthiness. To improve evaluation consistency and clinical grounding, a progressive prompt engineering strategy was introduced, incorporating evaluation rubrics, clinical knowledge, penalty rules, and structured output formats.