{"title":"捕捉动作三重关联以准确识别手术活动","authors":"Xiaoyang Zou , Derong Yu , Guoyan Zheng","doi":"10.1016/j.compmedimag.2025.102604","DOIUrl":null,"url":null,"abstract":"<div><div>Surgical activity recognition is essential for providing real-time, context-aware decision support in the development of computer-assisted surgery systems. To represent a fine-grained surgical activity, an action triplet, defined in the form of <span><math><mo><</mo></math></span>instrument, verb, target<span><math><mo>></mo></math></span>, is used. It provides information about three essential components of a surgical action, i.e., the instrument used to perform the action, the verb used to describe the action being performed, and the target tissue with which the instrument is interacting. A key challenge in surgical activity recognition lies in capturing the inherent correlations between action triplets and the associated components. In this paper, to address the challenge, starting with features extracted by a transformers-based spatial–temporal feature extractor with banded causal masks, we propose a novel framework for accurate surgical activity recognition by capturing action triplet correlations at both feature and output levels. At the feature level, we propose a graph convolutional networks (GCNs)-based module, referred as TripletGCN, to capture triplet correlations for feature enhancement. Inspired by the observation that surgeons perform specific operations using corresponding sets of instruments following clinical guidelines, a data-driven triplet correlation matrix is designed to guide information propagation among inter-dependent event nodes in TripletGCN. At the output level, in addition to applying binary cross-entropy loss for supervised learning, we propose an adversarial learning process, denoted as TripletAL, to align the joint triplet distribution between the ground truth labels and the predicted results, thereby further enhancing triplet correlations. To validate the efficacy of the proposed approach, we conducted comprehensive experiments on two publicly available datasets from the CholecTriplet2021 challenge, i.e., the CholecT45 dataset and the CholecT50 dataset. Our method achieves an average mean Average Precision (mAP) of 41.5% on the CholecT45 dataset using 5-fold cross-validation and an average mAP of 42.5% on the CholecT50 dataset using the challenge data split. Besides, we demonstrate the generalization capability of the proposed method for verb-target pair recognition on the publicly available SARAS-MESAD dataset.</div></div>","PeriodicalId":50631,"journal":{"name":"Computerized Medical Imaging and Graphics","volume":"124 ","pages":"Article 102604"},"PeriodicalIF":4.9000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Capturing action triplet correlations for accurate surgical activity recognition\",\"authors\":\"Xiaoyang Zou , Derong Yu , Guoyan Zheng\",\"doi\":\"10.1016/j.compmedimag.2025.102604\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Surgical activity recognition is essential for providing real-time, context-aware decision support in the development of computer-assisted surgery systems. To represent a fine-grained surgical activity, an action triplet, defined in the form of <span><math><mo><</mo></math></span>instrument, verb, target<span><math><mo>></mo></math></span>, is used. It provides information about three essential components of a surgical action, i.e., the instrument used to perform the action, the verb used to describe the action being performed, and the target tissue with which the instrument is interacting. A key challenge in surgical activity recognition lies in capturing the inherent correlations between action triplets and the associated components. In this paper, to address the challenge, starting with features extracted by a transformers-based spatial–temporal feature extractor with banded causal masks, we propose a novel framework for accurate surgical activity recognition by capturing action triplet correlations at both feature and output levels. At the feature level, we propose a graph convolutional networks (GCNs)-based module, referred as TripletGCN, to capture triplet correlations for feature enhancement. Inspired by the observation that surgeons perform specific operations using corresponding sets of instruments following clinical guidelines, a data-driven triplet correlation matrix is designed to guide information propagation among inter-dependent event nodes in TripletGCN. At the output level, in addition to applying binary cross-entropy loss for supervised learning, we propose an adversarial learning process, denoted as TripletAL, to align the joint triplet distribution between the ground truth labels and the predicted results, thereby further enhancing triplet correlations. To validate the efficacy of the proposed approach, we conducted comprehensive experiments on two publicly available datasets from the CholecTriplet2021 challenge, i.e., the CholecT45 dataset and the CholecT50 dataset. Our method achieves an average mean Average Precision (mAP) of 41.5% on the CholecT45 dataset using 5-fold cross-validation and an average mAP of 42.5% on the CholecT50 dataset using the challenge data split. Besides, we demonstrate the generalization capability of the proposed method for verb-target pair recognition on the publicly available SARAS-MESAD dataset.</div></div>\",\"PeriodicalId\":50631,\"journal\":{\"name\":\"Computerized Medical Imaging and Graphics\",\"volume\":\"124 \",\"pages\":\"Article 102604\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computerized Medical Imaging and Graphics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0895611125001132\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computerized Medical Imaging and Graphics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895611125001132","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
Capturing action triplet correlations for accurate surgical activity recognition
Surgical activity recognition is essential for providing real-time, context-aware decision support in the development of computer-assisted surgery systems. To represent a fine-grained surgical activity, an action triplet, defined in the form of instrument, verb, target, is used. It provides information about three essential components of a surgical action, i.e., the instrument used to perform the action, the verb used to describe the action being performed, and the target tissue with which the instrument is interacting. A key challenge in surgical activity recognition lies in capturing the inherent correlations between action triplets and the associated components. In this paper, to address the challenge, starting with features extracted by a transformers-based spatial–temporal feature extractor with banded causal masks, we propose a novel framework for accurate surgical activity recognition by capturing action triplet correlations at both feature and output levels. At the feature level, we propose a graph convolutional networks (GCNs)-based module, referred as TripletGCN, to capture triplet correlations for feature enhancement. Inspired by the observation that surgeons perform specific operations using corresponding sets of instruments following clinical guidelines, a data-driven triplet correlation matrix is designed to guide information propagation among inter-dependent event nodes in TripletGCN. At the output level, in addition to applying binary cross-entropy loss for supervised learning, we propose an adversarial learning process, denoted as TripletAL, to align the joint triplet distribution between the ground truth labels and the predicted results, thereby further enhancing triplet correlations. To validate the efficacy of the proposed approach, we conducted comprehensive experiments on two publicly available datasets from the CholecTriplet2021 challenge, i.e., the CholecT45 dataset and the CholecT50 dataset. Our method achieves an average mean Average Precision (mAP) of 41.5% on the CholecT45 dataset using 5-fold cross-validation and an average mAP of 42.5% on the CholecT50 dataset using the challenge data split. Besides, we demonstrate the generalization capability of the proposed method for verb-target pair recognition on the publicly available SARAS-MESAD dataset.
期刊介绍:
The purpose of the journal Computerized Medical Imaging and Graphics is to act as a source for the exchange of research results concerning algorithmic advances, development, and application of digital imaging in disease detection, diagnosis, intervention, prevention, precision medicine, and population health. Included in the journal will be articles on novel computerized imaging or visualization techniques, including artificial intelligence and machine learning, augmented reality for surgical planning and guidance, big biomedical data visualization, computer-aided diagnosis, computerized-robotic surgery, image-guided therapy, imaging scanning and reconstruction, mobile and tele-imaging, radiomics, and imaging integration and modeling with other information relevant to digital health. The types of biomedical imaging include: magnetic resonance, computed tomography, ultrasound, nuclear medicine, X-ray, microwave, optical and multi-photon microscopy, video and sensory imaging, and the convergence of biomedical images with other non-imaging datasets.