Temporal Surgical Gesture Segmentation and Classification in Multi-gesture Robotic Surgery using Fine-tuned features and Calibrated MS-TCN

2022 IEEE International Conference on Signal Processing and Communications (SPCOM) Pub Date : 2022-07-11 DOI:10.1109/SPCOM55316.2022.9840779

Snigdha Agarwal, Chakka Sai Pradeep, N. Sinha

{"title":"Temporal Surgical Gesture Segmentation and Classification in Multi-gesture Robotic Surgery using Fine-tuned features and Calibrated MS-TCN","authors":"Snigdha Agarwal, Chakka Sai Pradeep, N. Sinha","doi":"10.1109/SPCOM55316.2022.9840779","DOIUrl":null,"url":null,"abstract":"Temporal Gesture Segmentation is an active research problem for many applications such as surgical skill assessment, surgery training, robotic training. In this paper, we propose a novel method for Gesture Segmentation on untrimmed surgical videos of the challenging JIGSAWS dataset by using a two-step methodology. We train and evaluate our method on 39 videos of the Suturing task which has 10 gestures. The length of gestures ranges from 1 second to 75 seconds and full video length varies from 1 minute to 5 minutes. In step one, we extract encoded frame-wise spatio-temporal features on full temporal resolution of the untrimmed videos. In step two, we use these extracted features to identify gesture segments for temporal segmentation and classification. To extract high-quality features from the surgical videos, we also pre-train gesture classification models using transfer learning on the JIGSAWS dataset using two state-of-the-art pretrained backbone architectures. For segmentation, we propose an improved calibrated MS-TCN (CMS-TCN) by introducing a smoothed focal loss as loss function which helps in regularizing our TCN to avoid making over-confident decisions. We achieve a frame-wise accuracy of 89.8% and an Edit Distance score of 91.5%, an improvement of 2.2% from previous works. We also propose a novel evaluation metric that normalizes the effect of correctly classifying the frames of larger segments versus smaller segments in a single score.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840779","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Temporal Gesture Segmentation is an active research problem for many applications such as surgical skill assessment, surgery training, robotic training. In this paper, we propose a novel method for Gesture Segmentation on untrimmed surgical videos of the challenging JIGSAWS dataset by using a two-step methodology. We train and evaluate our method on 39 videos of the Suturing task which has 10 gestures. The length of gestures ranges from 1 second to 75 seconds and full video length varies from 1 minute to 5 minutes. In step one, we extract encoded frame-wise spatio-temporal features on full temporal resolution of the untrimmed videos. In step two, we use these extracted features to identify gesture segments for temporal segmentation and classification. To extract high-quality features from the surgical videos, we also pre-train gesture classification models using transfer learning on the JIGSAWS dataset using two state-of-the-art pretrained backbone architectures. For segmentation, we propose an improved calibrated MS-TCN (CMS-TCN) by introducing a smoothed focal loss as loss function which helps in regularizing our TCN to avoid making over-confident decisions. We achieve a frame-wise accuracy of 89.8% and an Edit Distance score of 91.5%, an improvement of 2.2% from previous works. We also propose a novel evaluation metric that normalizes the effect of correctly classifying the frames of larger segments versus smaller segments in a single score.

查看原文本刊更多论文

基于微调特征和校准MS-TCN的多手势机器人手术时间外科手势分割与分类

在手术技能评估、手术训练、机器人训练等领域，时间手势分割是一个非常活跃的研究课题。在本文中，我们提出了一种新的方法，使用两步法对具有挑战性的JIGSAWS数据集的未修剪手术视频进行手势分割。我们在39个有10个手势的缝合任务视频上训练和评估了我们的方法。手势时长为1秒~ 75秒，完整视频时长为1分钟~ 5分钟。在第一步中，我们在全时间分辨率下提取未修剪视频的编码帧-时空特征。在第二步，我们使用这些提取的特征来识别手势片段进行时间分割和分类。为了从手术视频中提取高质量的特征，我们还使用两种最先进的预训练骨干架构，在JIGSAWS数据集上使用迁移学习预训练手势分类模型。对于分割，我们提出了一种改进的校准MS-TCN (CMS-TCN)，通过引入平滑的焦损失作为损失函数，有助于正则化我们的TCN，以避免做出过度自信的决策。我们实现了89.8%的帧精度和91.5%的编辑距离得分，比以前的工作提高了2.2%。我们还提出了一种新的评估指标，该指标规范了在单个分数中正确分类较大片段与较小片段的帧的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Signal Processing and Communications (SPCOM)

自引率

0.00%

发文量