SSIFNet: Spatial–temporal stereo information fusion network for self-supervised surgical video inpainting

IF 4.9 2区医学 Q1 ENGINEERING, BIOMEDICAL

Computerized Medical Imaging and Graphics Pub Date : 2025-08-25 DOI:10.1016/j.compmedimag.2025.102622

Xiaoyang Zou , Zhuyuan Zhang , Derong Yu , Wenyuan Sun , Wenyong Liu , Donghua Hang , Wei Bao , Guoyan Zheng

{"title":"SSIFNet: Spatial–temporal stereo information fusion network for self-supervised surgical video inpainting","authors":"Xiaoyang Zou , Zhuyuan Zhang , Derong Yu , Wenyuan Sun , Wenyong Liu , Donghua Hang , Wei Bao , Guoyan Zheng","doi":"10.1016/j.compmedimag.2025.102622","DOIUrl":null,"url":null,"abstract":"<div><div>During minimally invasive robot-assisted surgical procedures, surgeons rely on stereo endoscopes to provide image guidance. Nevertheless, the field-of-view is typically restricted owing to the limited size of the endoscope and constrained workspace. Such a visualization challenge becomes even more severe when surgical instruments are inserted into the already restricted field-of-view, where important anatomical landmarks and relevant clinical contents may become occluded by the inserted instruments. To address the challenge, in this work, we propose a novel end-to-end trainable spatial–temporal stereo information fusion network, referred as SSIFNet, for inpainting surgical videos of surgical scene under instrument occlusions in robot-assisted endoscopic surgery. The proposed SSIFNet features three essential modules including a novel optical flow-guided deformable feature propagation (OFDFP) module, a novel spatial–temporal stereo focal transformer (S<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>FT)-based information fusion module, and a novel stereo-consistency enforcement (SE) module. These three modules work synergistically to inpaint occluded regions in the surgical scene. More importantly, SSIFNet is trained in a self-supervised manner with simulated occlusions by a novel loss function, which is designed to combine flow completion, disparity matching, cross-warping consistency, warping-consistency, image and adversarial loss terms to generate high fidelity and accurate occlusion reconstructions in both views. After training, the trained model can be applied directly to inpainting surgical videos with true instrument occlusions to generate results with not only spatial and temporal consistency but also stereo-consistency. Comprehensive quantitative and qualitative experimental results demonstrate that SSIFNet outperforms state-of-the-art (SOTA) video inpainting methods. The source code of this study will be released at <span><span>https://github.com/SHAUNZXY/SSIFNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50631,"journal":{"name":"Computerized Medical Imaging and Graphics","volume":"125 ","pages":"Article 102622"},"PeriodicalIF":4.9000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computerized Medical Imaging and Graphics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895611125001314","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

During minimally invasive robot-assisted surgical procedures, surgeons rely on stereo endoscopes to provide image guidance. Nevertheless, the field-of-view is typically restricted owing to the limited size of the endoscope and constrained workspace. Such a visualization challenge becomes even more severe when surgical instruments are inserted into the already restricted field-of-view, where important anatomical landmarks and relevant clinical contents may become occluded by the inserted instruments. To address the challenge, in this work, we propose a novel end-to-end trainable spatial–temporal stereo information fusion network, referred as SSIFNet, for inpainting surgical videos of surgical scene under instrument occlusions in robot-assisted endoscopic surgery. The proposed SSIFNet features three essential modules including a novel optical flow-guided deformable feature propagation (OFDFP) module, a novel spatial–temporal stereo focal transformer (S

^{2}

FT)-based information fusion module, and a novel stereo-consistency enforcement (SE) module. These three modules work synergistically to inpaint occluded regions in the surgical scene. More importantly, SSIFNet is trained in a self-supervised manner with simulated occlusions by a novel loss function, which is designed to combine flow completion, disparity matching, cross-warping consistency, warping-consistency, image and adversarial loss terms to generate high fidelity and accurate occlusion reconstructions in both views. After training, the trained model can be applied directly to inpainting surgical videos with true instrument occlusions to generate results with not only spatial and temporal consistency but also stereo-consistency. Comprehensive quantitative and qualitative experimental results demonstrate that SSIFNet outperforms state-of-the-art (SOTA) video inpainting methods. The source code of this study will be released at https://github.com/SHAUNZXY/SSIFNet.

查看原文本刊更多论文

SSIFNet：用于自监督手术视频喷漆的时空立体信息融合网络

在微创机器人辅助手术过程中，外科医生依靠立体内窥镜提供图像引导。然而，由于内窥镜的尺寸和工作空间的限制，视野通常受到限制。当手术器械被插入到已经受限的视野中时，这种可视化挑战变得更加严峻，因为重要的解剖标志和相关的临床内容可能会被插入的器械遮挡。为了解决这一挑战，在这项工作中，我们提出了一种新颖的端到端可训练的时空立体信息融合网络（SSIFNet），用于在机器人辅助内窥镜手术中绘制器械闭塞下的手术场景视频。提出的SSIFNet具有三个基本模块，包括新型光流引导可变形特征传播（OFDFP）模块、新型基于时空立体焦变压器（S2FT）的信息融合模块和新型立体一致性增强（SE）模块。这三个模块协同工作，在手术场景中绘制闭塞区域。更重要的是，SSIFNet通过一种新的损失函数对模拟遮挡进行自监督训练，该损失函数结合了流补全、视差匹配、交叉扭曲一致性、扭曲一致性、图像和对抗损失项，在两个视图中生成高保真和准确的遮挡重建。训练后的模型可以直接应用于真实器械闭塞的手术视频中，生成的结果不仅具有空间一致性和时间一致性，而且具有立体一致性。综合的定量和定性实验结果表明，SSIFNet优于最先进的（SOTA）视频喷漆方法。本研究的源代码将在https://github.com/SHAUNZXY/SSIFNet上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computerized Medical Imaging and Graphics 医学-核医学

CiteScore

10.70

自引率

3.50%

发文量

审稿时长

26 days

期刊介绍： The purpose of the journal Computerized Medical Imaging and Graphics is to act as a source for the exchange of research results concerning algorithmic advances, development, and application of digital imaging in disease detection, diagnosis, intervention, prevention, precision medicine, and population health. Included in the journal will be articles on novel computerized imaging or visualization techniques, including artificial intelligence and machine learning, augmented reality for surgical planning and guidance, big biomedical data visualization, computer-aided diagnosis, computerized-robotic surgery, image-guided therapy, imaging scanning and reconstruction, mobile and tele-imaging, radiomics, and imaging integration and modeling with other information relevant to digital health. The types of biomedical imaging include: magnetic resonance, computed tomography, ultrasound, nuclear medicine, X-ray, microwave, optical and multi-photon microscopy, video and sensory imaging, and the convergence of biomedical images with other non-imaging datasets.