Xiaoyang Zou , Zhuyuan Zhang , Derong Yu , Wenyuan Sun , Wenyong Liu , Donghua Hang , Wei Bao , Guoyan Zheng
{"title":"SSIFNet:用于自监督手术视频喷漆的时空立体信息融合网络","authors":"Xiaoyang Zou , Zhuyuan Zhang , Derong Yu , Wenyuan Sun , Wenyong Liu , Donghua Hang , Wei Bao , Guoyan Zheng","doi":"10.1016/j.compmedimag.2025.102622","DOIUrl":null,"url":null,"abstract":"<div><div>During minimally invasive robot-assisted surgical procedures, surgeons rely on stereo endoscopes to provide image guidance. Nevertheless, the field-of-view is typically restricted owing to the limited size of the endoscope and constrained workspace. Such a visualization challenge becomes even more severe when surgical instruments are inserted into the already restricted field-of-view, where important anatomical landmarks and relevant clinical contents may become occluded by the inserted instruments. To address the challenge, in this work, we propose a novel end-to-end trainable spatial–temporal stereo information fusion network, referred as SSIFNet, for inpainting surgical videos of surgical scene under instrument occlusions in robot-assisted endoscopic surgery. The proposed SSIFNet features three essential modules including a novel optical flow-guided deformable feature propagation (OFDFP) module, a novel spatial–temporal stereo focal transformer (S<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>FT)-based information fusion module, and a novel stereo-consistency enforcement (SE) module. These three modules work synergistically to inpaint occluded regions in the surgical scene. More importantly, SSIFNet is trained in a self-supervised manner with simulated occlusions by a novel loss function, which is designed to combine flow completion, disparity matching, cross-warping consistency, warping-consistency, image and adversarial loss terms to generate high fidelity and accurate occlusion reconstructions in both views. After training, the trained model can be applied directly to inpainting surgical videos with true instrument occlusions to generate results with not only spatial and temporal consistency but also stereo-consistency. Comprehensive quantitative and qualitative experimental results demonstrate that SSIFNet outperforms state-of-the-art (SOTA) video inpainting methods. The source code of this study will be released at <span><span>https://github.com/SHAUNZXY/SSIFNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50631,"journal":{"name":"Computerized Medical Imaging and Graphics","volume":"125 ","pages":"Article 102622"},"PeriodicalIF":4.9000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SSIFNet: Spatial–temporal stereo information fusion network for self-supervised surgical video inpainting\",\"authors\":\"Xiaoyang Zou , Zhuyuan Zhang , Derong Yu , Wenyuan Sun , Wenyong Liu , Donghua Hang , Wei Bao , Guoyan Zheng\",\"doi\":\"10.1016/j.compmedimag.2025.102622\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>During minimally invasive robot-assisted surgical procedures, surgeons rely on stereo endoscopes to provide image guidance. Nevertheless, the field-of-view is typically restricted owing to the limited size of the endoscope and constrained workspace. Such a visualization challenge becomes even more severe when surgical instruments are inserted into the already restricted field-of-view, where important anatomical landmarks and relevant clinical contents may become occluded by the inserted instruments. To address the challenge, in this work, we propose a novel end-to-end trainable spatial–temporal stereo information fusion network, referred as SSIFNet, for inpainting surgical videos of surgical scene under instrument occlusions in robot-assisted endoscopic surgery. The proposed SSIFNet features three essential modules including a novel optical flow-guided deformable feature propagation (OFDFP) module, a novel spatial–temporal stereo focal transformer (S<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>FT)-based information fusion module, and a novel stereo-consistency enforcement (SE) module. These three modules work synergistically to inpaint occluded regions in the surgical scene. More importantly, SSIFNet is trained in a self-supervised manner with simulated occlusions by a novel loss function, which is designed to combine flow completion, disparity matching, cross-warping consistency, warping-consistency, image and adversarial loss terms to generate high fidelity and accurate occlusion reconstructions in both views. After training, the trained model can be applied directly to inpainting surgical videos with true instrument occlusions to generate results with not only spatial and temporal consistency but also stereo-consistency. Comprehensive quantitative and qualitative experimental results demonstrate that SSIFNet outperforms state-of-the-art (SOTA) video inpainting methods. The source code of this study will be released at <span><span>https://github.com/SHAUNZXY/SSIFNet</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50631,\"journal\":{\"name\":\"Computerized Medical Imaging and Graphics\",\"volume\":\"125 \",\"pages\":\"Article 102622\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computerized Medical Imaging and Graphics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0895611125001314\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computerized Medical Imaging and Graphics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895611125001314","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
SSIFNet: Spatial–temporal stereo information fusion network for self-supervised surgical video inpainting
During minimally invasive robot-assisted surgical procedures, surgeons rely on stereo endoscopes to provide image guidance. Nevertheless, the field-of-view is typically restricted owing to the limited size of the endoscope and constrained workspace. Such a visualization challenge becomes even more severe when surgical instruments are inserted into the already restricted field-of-view, where important anatomical landmarks and relevant clinical contents may become occluded by the inserted instruments. To address the challenge, in this work, we propose a novel end-to-end trainable spatial–temporal stereo information fusion network, referred as SSIFNet, for inpainting surgical videos of surgical scene under instrument occlusions in robot-assisted endoscopic surgery. The proposed SSIFNet features three essential modules including a novel optical flow-guided deformable feature propagation (OFDFP) module, a novel spatial–temporal stereo focal transformer (SFT)-based information fusion module, and a novel stereo-consistency enforcement (SE) module. These three modules work synergistically to inpaint occluded regions in the surgical scene. More importantly, SSIFNet is trained in a self-supervised manner with simulated occlusions by a novel loss function, which is designed to combine flow completion, disparity matching, cross-warping consistency, warping-consistency, image and adversarial loss terms to generate high fidelity and accurate occlusion reconstructions in both views. After training, the trained model can be applied directly to inpainting surgical videos with true instrument occlusions to generate results with not only spatial and temporal consistency but also stereo-consistency. Comprehensive quantitative and qualitative experimental results demonstrate that SSIFNet outperforms state-of-the-art (SOTA) video inpainting methods. The source code of this study will be released at https://github.com/SHAUNZXY/SSIFNet.
期刊介绍:
The purpose of the journal Computerized Medical Imaging and Graphics is to act as a source for the exchange of research results concerning algorithmic advances, development, and application of digital imaging in disease detection, diagnosis, intervention, prevention, precision medicine, and population health. Included in the journal will be articles on novel computerized imaging or visualization techniques, including artificial intelligence and machine learning, augmented reality for surgical planning and guidance, big biomedical data visualization, computer-aided diagnosis, computerized-robotic surgery, image-guided therapy, imaging scanning and reconstruction, mobile and tele-imaging, radiomics, and imaging integration and modeling with other information relevant to digital health. The types of biomedical imaging include: magnetic resonance, computed tomography, ultrasound, nuclear medicine, X-ray, microwave, optical and multi-photon microscopy, video and sensory imaging, and the convergence of biomedical images with other non-imaging datasets.