Real-Time Multimodal Activity-Aware Error Detection in Robot-Assisted Surgery

2026-06-22 • Robotics

RoboticsComputer Vision and Pattern Recognition

AI summaryⓘ

The authors developed a new system that helps spot mistakes during robot-assisted surgeries by looking at videos, movement data, and carefully written descriptions together. They include detailed explanations of the steps and errors in the surgery to improve understanding. Their method uses advanced image analysis combined with motion and text information, which helps the system detect errors more accurately. Tests showed their approach works better than previous methods on two surgery datasets.

robot-assisted surgeryexecutional error detectionmultimodal inputvideo analysiskinematic datatextual promptscontrastive language-image embeddingssurgical activity labelsF1 scoreJIGSAWS dataset

Authors

Seyed Hamid Reza Roodabeh, Zongyu Li, Homa Alemzadeh

Abstract

Robot-assisted minimally invasive surgery improves surgical precision but introduces complexity, making technical error detection essential for ensuring patient safety. Current executional error detection methods using video data often overlook fine-grained contextual descriptions of activities and error types within the hierarchical structure of surgical procedures. They also under-utilize complementary multimodal information. We propose a unified framework for executional error detection that leverages multimodal input, including video, kinematics, and descriptive textual prompts. Through activity prompting, we integrate descriptive language in gesture-level activities, instrument-object interactions, and error definitions. We also introduce activity-aware visual embeddings derived from vision encoders pretrained on surgical activity labels to compare the effectiveness of contrastive language-image embeddings with traditional image-based embeddings for error detection. By seamlessly integrating kinematic data with video and textual modalities, our framework significantly improves error detection performance. Achieving up to 5\% and 16.6\% F1 score improvements over state-of-the-art baselines on the JIGSAWS and SAR-RARP50 datasets, respectively, we demonstrate the value of combining curated textual prompts with multimodal data for accurate error detection.

View PDFOpen arXiv