Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy
2026-04-10 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionRobotics
AI summaryⓘ
The authors studied how to recognize very detailed sewing actions during robot-assisted kidney surgery using video data. They created a benchmark called SIA-RAPN with 50 surgery videos labeled frame-by-frame into 12 action types. They tested four computer models to see which best identifies these actions, measuring accuracy in several ways. DiffAct performed best on most metrics, while MS-TCN++ had the highest balanced accuracy. They also tested how well these models worked on a different but related surgery dataset.
renorrhaphyrobot-assisted partial nephrectomyfine-grained action segmentationtemporal modelsI3D featuresMS-TCN++AsFormerDiffActbalanced accuracysegmental F1 score
Authors
Jiaheng Dai, Huanrong Liu, Tailai Zhou, Tongyu Jia, Qin Liu, Yutong Ban, Zeju Li, Yu Gao, Xin Ma, Qingbiao Li
Abstract
Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.