STANet: A Surgical Gesture Recognition Method Based on Spatiotemporal Fusion

IF 4.8 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Annals of the New York Academy of Sciences Pub Date : 2025-09-24 DOI:10.1111/nyas.70053

Boqiang Jia, Wenjie Wang, Xin Tian, Xiaohua Wang

{"title":"STANet: A Surgical Gesture Recognition Method Based on Spatiotemporal Fusion","authors":"Boqiang Jia, Wenjie Wang, Xin Tian, Xiaohua Wang","doi":"10.1111/nyas.70053","DOIUrl":null,"url":null,"abstract":"In robotic surgery, surgical gesture recognition has great importance in surgical quality evaluation and intelligent recognition assistance. Currently, deep learning models, such as recurrent neural networks and temporal convolutional networks, are mainly used to model action sequences and capture the temporal dependencies between them. However, some of these methods ignore the fusion of spatial and temporal features, and hence cannot effectively capture long‐term relationships and efficiently model action sequences. To overcome these limitations, we propose a spatiotemporal adaptive network (STANet) to fuse spatiotemporal features. Specifically, we designed a temporal module and a spatial module to extract respective features. Subsequently, these features were fused and further refined through temporal modeling using a temporal adaptive convolution strategy. This approach integrates both long‐term and short‐term characteristics of surgical gesture sequences. The organic combination of temporal and spatial modules was inserted into the backbone network to form the STANet, which efficiently modeled the action sequences. Our approach has been validated on the publicly available surgical gesture datasets JIGSAWS and RARP‐45, achieving very good results. Compared to other reported benchmark models, our model demonstrates exceptional performance. It can be used in surgical robots, visual feedback systems, and computer‐assisted surgery.","PeriodicalId":8250,"journal":{"name":"Annals of the New York Academy of Sciences","volume":"1 1","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of the New York Academy of Sciences","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1111/nyas.70053","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

In robotic surgery, surgical gesture recognition has great importance in surgical quality evaluation and intelligent recognition assistance. Currently, deep learning models, such as recurrent neural networks and temporal convolutional networks, are mainly used to model action sequences and capture the temporal dependencies between them. However, some of these methods ignore the fusion of spatial and temporal features, and hence cannot effectively capture long‐term relationships and efficiently model action sequences. To overcome these limitations, we propose a spatiotemporal adaptive network (STANet) to fuse spatiotemporal features. Specifically, we designed a temporal module and a spatial module to extract respective features. Subsequently, these features were fused and further refined through temporal modeling using a temporal adaptive convolution strategy. This approach integrates both long‐term and short‐term characteristics of surgical gesture sequences. The organic combination of temporal and spatial modules was inserted into the backbone network to form the STANet, which efficiently modeled the action sequences. Our approach has been validated on the publicly available surgical gesture datasets JIGSAWS and RARP‐45, achieving very good results. Compared to other reported benchmark models, our model demonstrates exceptional performance. It can be used in surgical robots, visual feedback systems, and computer‐assisted surgery.

查看原文本刊更多论文

基于时空融合的外科手势识别方法

在机器人手术中，手术手势识别在手术质量评价和智能识别辅助中具有重要意义。目前，深度学习模型，如递归神经网络和时间卷积网络，主要用于动作序列建模和捕获它们之间的时间依赖性。然而，其中一些方法忽略了空间和时间特征的融合，因此不能有效地捕获长期关系并有效地模拟动作序列。为了克服这些限制，我们提出了一个时空自适应网络（STANet）来融合时空特征。具体来说，我们设计了一个时间模块和一个空间模块来提取各自的特征。随后，通过使用时间自适应卷积策略进行时间建模，将这些特征融合并进一步细化。这种方法整合了手术手势序列的长期和短期特征。在骨干网中插入时间和空间模块的有机结合，形成STANet，有效地对动作序列进行建模。我们的方法已经在公开可用的手术手势数据集JIGSAWS和RARP‐45上进行了验证，取得了非常好的效果。与其他已报道的基准模型相比，我们的模型表现出卓越的性能。它可以用于手术机器人、视觉反馈系统和计算机辅助手术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of the New York Academy of Sciences 综合性期刊-综合性期刊

CiteScore

11.00

自引率

1.90%

发文量

193

审稿时长

2-4 weeks

期刊介绍： Published on behalf of the New York Academy of Sciences, Annals of the New York Academy of Sciences provides multidisciplinary perspectives on research of current scientific interest with far-reaching implications for the wider scientific community and society at large. Each special issue assembles the best thinking of key contributors to a field of investigation at a time when emerging developments offer the promise of new insight. Individually themed, Annals special issues stimulate new ways to think about science by providing a neutral forum for discourse—within and across many institutions and fields.