Multimodal graph representation learning for robust surgical workflow recognition with adversarial feature disentanglement

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-05-16 DOI:10.1016/j.inffus.2025.103290

Long Bai , Boyi Ma , Ruohan Wang , Guankun Wang , Beilei Cui , Zhongliang Jiang , Mobarakol Islam , Zhe Min , Jiewen Lai , Nassir Navab , Hongliang Ren

{"title":"Multimodal graph representation learning for robust surgical workflow recognition with adversarial feature disentanglement","authors":"Long Bai , Boyi Ma , Ruohan Wang , Guankun Wang , Beilei Cui , Zhongliang Jiang , Mobarakol Islam , Zhe Min , Jiewen Lai , Nassir Navab , Hongliang Ren","doi":"10.1016/j.inffus.2025.103290","DOIUrl":null,"url":null,"abstract":"<div><div>Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. Therefore, a robust workflow recognition model is urgently needed. In this case, we explore a robust graph-based multimodal approach to integrating vision and kinematic data to enhance accuracy and reliability. Vision data captures dynamic surgical scenes, while kinematic data provides precise movement information, overcoming limitations of visual recognition under adverse conditions. We propose a multimodal Graph Representation network with Adversarial feature Disentanglement (GRAD) for robust surgical workflow recognition in challenging scenarios with domain shifts or corrupted data. Specifically, we introduce a Multimodal Disentanglement Graph Network (MDGNet) that captures fine-grained visual information while explicitly modeling the complex relationships between vision and kinematic embeddings through graph-based message modeling. To align feature spaces across modalities, we propose a Vision-Kinematic Adversarial (VKA) framework that leverages adversarial training to reduce modality gaps and improve feature consistency. Furthermore, we design a Contextual Calibrated Decoder, incorporating temporal and contextual priors to enhance robustness against domain shifts and corrupted data. Extensive comparative and ablation experiments demonstrate the effectiveness of our model and proposed modules. Specifically, we achieved an accuracy of 86.87% and 92.38% on two public datasets, respectively. Moreover, our robustness experiments show that our method effectively handles data corruption during storage and transmission, exhibiting excellent stability and robustness. Our approach aims to advance automated surgical workflow recognition, addressing the complexities and dynamism inherent in surgical procedures.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103290"},"PeriodicalIF":14.7000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S156625352500363X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. Therefore, a robust workflow recognition model is urgently needed. In this case, we explore a robust graph-based multimodal approach to integrating vision and kinematic data to enhance accuracy and reliability. Vision data captures dynamic surgical scenes, while kinematic data provides precise movement information, overcoming limitations of visual recognition under adverse conditions. We propose a multimodal Graph Representation network with Adversarial feature Disentanglement (GRAD) for robust surgical workflow recognition in challenging scenarios with domain shifts or corrupted data. Specifically, we introduce a Multimodal Disentanglement Graph Network (MDGNet) that captures fine-grained visual information while explicitly modeling the complex relationships between vision and kinematic embeddings through graph-based message modeling. To align feature spaces across modalities, we propose a Vision-Kinematic Adversarial (VKA) framework that leverages adversarial training to reduce modality gaps and improve feature consistency. Furthermore, we design a Contextual Calibrated Decoder, incorporating temporal and contextual priors to enhance robustness against domain shifts and corrupted data. Extensive comparative and ablation experiments demonstrate the effectiveness of our model and proposed modules. Specifically, we achieved an accuracy of 86.87% and 92.38% on two public datasets, respectively. Moreover, our robustness experiments show that our method effectively handles data corruption during storage and transmission, exhibiting excellent stability and robustness. Our approach aims to advance automated surgical workflow recognition, addressing the complexities and dynamism inherent in surgical procedures.

查看原文本刊更多论文

基于多模态图表示学习的对抗特征解纠缠鲁棒外科工作流程识别

手术工作流程识别对于自动化任务、支持决策和培训新手外科医生、最终提高患者安全性和标准化程序至关重要。然而，由于手术场景中出血或烟雾造成的阻塞以及数据存储和传输问题等问题，数据损坏可能导致性能下降。因此，迫切需要一个鲁棒的工作流识别模型。在这种情况下，我们探索了一种鲁棒的基于图的多模态方法来整合视觉和运动学数据，以提高准确性和可靠性。视觉数据捕获动态手术场景，而运动学数据提供精确的运动信息，克服了视觉识别在不利条件下的局限性。我们提出了一种具有对抗特征解纠缠（GRAD）的多模态图表示网络，用于在具有域移位或数据损坏的挑战性场景下进行鲁棒外科工作流程识别。具体来说，我们引入了一个多模态解纠缠图网络（MDGNet），该网络捕获细粒度的视觉信息，同时通过基于图的消息建模显式地建模视觉和运动学嵌入之间的复杂关系。为了跨模态对齐特征空间，我们提出了一个视觉-运动学对抗（VKA）框架，该框架利用对抗训练来减少模态差距并提高特征一致性。此外，我们设计了一个上下文校准解码器，结合时间和上下文先验，以增强对域移位和损坏数据的鲁棒性。大量的对比和烧蚀实验证明了我们的模型和提出的模块的有效性。具体来说，我们在两个公共数据集上分别实现了86.87%和92.38%的准确率。鲁棒性实验表明，该方法有效地处理了存储和传输过程中的数据损坏，具有良好的稳定性和鲁棒性。我们的方法旨在推进自动化手术工作流程识别，解决外科手术过程中固有的复杂性和动态性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.