Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill Assessment
2026-06-01 • Human-Computer Interaction
Human-Computer Interaction
AI summaryⓘ
The authors organized a competition to test how well computer systems can understand the order of steps in emergency medical procedures by looking at shuffled video frames and explaining their reasoning. They created a benchmark with 200 examples from clinical skill videos, each with the correct step order and expert explanations. Seven teams competed, and their systems were judged on how accurately they could reorder the steps and generate good explanations. The paper describes the task, data, and results, finding that current AI models still find it hard to combine visual, timing, and medical knowledge for this kind of reasoning.
clinical skill assessmentprocedural reasoningtemporal order reconstructionclinical workflowcontinuous perceptionemergency-care proceduresbenchmark datasetBERTScoretask accuracyBioNLP Workshop
Authors
Xiyang Huang, Renxiong Wei, Yihuai Xu, Zhiyuan Chen, Keying Wu, Jiayi Xiang, Buzhou Tang, Yanqing Ye, Jinyu Chen, Cheng Zeng, Min Peng, Qianqian Xie, Sophia Ananiadou
Abstract
This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.