Capturing action triplet correlations for accurate surgical activity recognition

IF 4.9 2区医学 Q1 ENGINEERING, BIOMEDICAL

Computerized Medical Imaging and Graphics Pub Date : 2025-07-14 DOI:10.1016/j.compmedimag.2025.102604

Xiaoyang Zou , Derong Yu , Guoyan Zheng

{"title":"Capturing action triplet correlations for accurate surgical activity recognition","authors":"Xiaoyang Zou , Derong Yu , Guoyan Zheng","doi":"10.1016/j.compmedimag.2025.102604","DOIUrl":null,"url":null,"abstract":"<div><div>Surgical activity recognition is essential for providing real-time, context-aware decision support in the development of computer-assisted surgery systems. To represent a fine-grained surgical activity, an action triplet, defined in the form of <span><math><mo><</mo></math></span>instrument, verb, target<span><math><mo>></mo></math></span>, is used. It provides information about three essential components of a surgical action, i.e., the instrument used to perform the action, the verb used to describe the action being performed, and the target tissue with which the instrument is interacting. A key challenge in surgical activity recognition lies in capturing the inherent correlations between action triplets and the associated components. In this paper, to address the challenge, starting with features extracted by a transformers-based spatial–temporal feature extractor with banded causal masks, we propose a novel framework for accurate surgical activity recognition by capturing action triplet correlations at both feature and output levels. At the feature level, we propose a graph convolutional networks (GCNs)-based module, referred as TripletGCN, to capture triplet correlations for feature enhancement. Inspired by the observation that surgeons perform specific operations using corresponding sets of instruments following clinical guidelines, a data-driven triplet correlation matrix is designed to guide information propagation among inter-dependent event nodes in TripletGCN. At the output level, in addition to applying binary cross-entropy loss for supervised learning, we propose an adversarial learning process, denoted as TripletAL, to align the joint triplet distribution between the ground truth labels and the predicted results, thereby further enhancing triplet correlations. To validate the efficacy of the proposed approach, we conducted comprehensive experiments on two publicly available datasets from the CholecTriplet2021 challenge, i.e., the CholecT45 dataset and the CholecT50 dataset. Our method achieves an average mean Average Precision (mAP) of 41.5% on the CholecT45 dataset using 5-fold cross-validation and an average mAP of 42.5% on the CholecT50 dataset using the challenge data split. Besides, we demonstrate the generalization capability of the proposed method for verb-target pair recognition on the publicly available SARAS-MESAD dataset.</div></div>","PeriodicalId":50631,"journal":{"name":"Computerized Medical Imaging and Graphics","volume":"124 ","pages":"Article 102604"},"PeriodicalIF":4.9000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computerized Medical Imaging and Graphics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895611125001132","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Surgical activity recognition is essential for providing real-time, context-aware decision support in the development of computer-assisted surgery systems. To represent a fine-grained surgical activity, an action triplet, defined in the form of

<

instrument, verb, target

>

, is used. It provides information about three essential components of a surgical action, i.e., the instrument used to perform the action, the verb used to describe the action being performed, and the target tissue with which the instrument is interacting. A key challenge in surgical activity recognition lies in capturing the inherent correlations between action triplets and the associated components. In this paper, to address the challenge, starting with features extracted by a transformers-based spatial–temporal feature extractor with banded causal masks, we propose a novel framework for accurate surgical activity recognition by capturing action triplet correlations at both feature and output levels. At the feature level, we propose a graph convolutional networks (GCNs)-based module, referred as TripletGCN, to capture triplet correlations for feature enhancement. Inspired by the observation that surgeons perform specific operations using corresponding sets of instruments following clinical guidelines, a data-driven triplet correlation matrix is designed to guide information propagation among inter-dependent event nodes in TripletGCN. At the output level, in addition to applying binary cross-entropy loss for supervised learning, we propose an adversarial learning process, denoted as TripletAL, to align the joint triplet distribution between the ground truth labels and the predicted results, thereby further enhancing triplet correlations. To validate the efficacy of the proposed approach, we conducted comprehensive experiments on two publicly available datasets from the CholecTriplet2021 challenge, i.e., the CholecT45 dataset and the CholecT50 dataset. Our method achieves an average mean Average Precision (mAP) of 41.5% on the CholecT45 dataset using 5-fold cross-validation and an average mAP of 42.5% on the CholecT50 dataset using the challenge data split. Besides, we demonstrate the generalization capability of the proposed method for verb-target pair recognition on the publicly available SARAS-MESAD dataset.

查看原文本刊更多论文

捕捉动作三重关联以准确识别手术活动

在计算机辅助手术系统的开发中，手术活动识别对于提供实时的、情境感知的决策支持至关重要。为了表示细粒度的外科手术活动，使用了一个动作三元组，以instrument， verb， target的形式定义。它提供了关于手术操作的三个基本组成部分的信息，即，用于执行操作的器械，用于描述正在执行的操作的动词，以及与器械相互作用的目标组织。手术活动识别的一个关键挑战在于捕捉动作三联体和相关成分之间的内在相关性。在本文中，为了解决这一挑战，我们从具有带状因果掩模的基于变压器的时空特征提取器提取的特征开始，提出了一个新的框架，通过在特征和输出水平上捕获动作三元组相关性来准确识别手术活动。在特征层面，我们提出了一个基于图卷积网络（GCNs）的模块，称为TripletGCN，用于捕获三重关联以增强特征。受观察到外科医生根据临床指南使用相应的器械进行特定手术的启发，设计了一个数据驱动的三重相关矩阵来指导TripletGCN中相互依赖的事件节点之间的信息传播。在输出层面，除了将二元交叉熵损失应用于监督学习之外，我们还提出了一种对抗性学习过程，称为TripletAL，以对齐基础真值标签和预测结果之间的联合三重分布，从而进一步增强三重相关性。为了验证所提出方法的有效性，我们在CholecTriplet2021挑战的两个公开数据集（即CholecT45数据集和CholecT50数据集）上进行了全面的实验。我们的方法使用5倍交叉验证在CholecT45数据集上实现了41.5%的平均平均精度（mAP），使用挑战数据分割在CholecT50数据集上实现了42.5%的平均mAP。此外，我们在公开可用的SARAS-MESAD数据集上验证了所提出方法的动词-目标对识别的泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computerized Medical Imaging and Graphics 医学-核医学

CiteScore

10.70

自引率

3.50%

发文量

审稿时长

26 days

期刊介绍： The purpose of the journal Computerized Medical Imaging and Graphics is to act as a source for the exchange of research results concerning algorithmic advances, development, and application of digital imaging in disease detection, diagnosis, intervention, prevention, precision medicine, and population health. Included in the journal will be articles on novel computerized imaging or visualization techniques, including artificial intelligence and machine learning, augmented reality for surgical planning and guidance, big biomedical data visualization, computer-aided diagnosis, computerized-robotic surgery, image-guided therapy, imaging scanning and reconstruction, mobile and tele-imaging, radiomics, and imaging integration and modeling with other information relevant to digital health. The types of biomedical imaging include: magnetic resonance, computed tomography, ultrasound, nuclear medicine, X-ray, microwave, optical and multi-photon microscopy, video and sensory imaging, and the convergence of biomedical images with other non-imaging datasets.