用于视频息肉分割的半监督时空校准和语义细化网络

IF 4.9 2区医学 Q1 ENGINEERING, BIOMEDICAL

Biomedical Signal Processing and Control Pub Date : 2024-11-13 DOI:10.1016/j.bspc.2024.107127

Feng Li , Zetao Huang , Lu Zhou , Haixia Peng , Yimin Chu

{"title":"用于视频息肉分割的半监督时空校准和语义细化网络","authors":"Feng Li , Zetao Huang , Lu Zhou , Haixia Peng , Yimin Chu","doi":"10.1016/j.bspc.2024.107127","DOIUrl":null,"url":null,"abstract":"<div><div>Automated video polyp segmentation (VPS) was of vitality for the early prevention and diagnosis of colorectal cancer (CRC). However, existing deep learning-based automatic polyp segmentation methods mainly focused on independent static images and struggled to perform well due to neglecting spatial–temporal relationships among successive video frames, while requiring massive frame-by-frame annotations. To better alleviate these challenges, we proposed a novel semi-supervised spatial–temporal calibration and semantic refinement network (STCSR-Net) dedicated to VPS, which simultaneously considered both the inter-frame temporal consistency in video clips and intra-frame semantic-spatial information. It was composed of a segmentation pathway and a propagation pathway by use of a co-training scheme for supervising the predictions on un-annotated images in a semi-supervised learning fashion. Specifically, we proposed an adaptive sequence calibration (ASC) block in segmentation pathway and a dynamic transmission calibration (DTC) block in propagation pathway to fully take advantage of valuable temporal cues and maintain the prediction temporally consistent among consecutive frames. Meanwhile, in these two branches, we introduced residual block (RB) to suppress irrelevant noisy information and highlight rich local boundary details of polyp lesions, while constructed multi-scale context extraction (MCE) module to enhance multi-scale high-level semantic feature expression. On that basis, we designed progressive adaptive context fusion (PACF) module to gradually aggregate multi-level features under the guidance of reinforced high-level semantic information for eliminating semantic gaps among them and promoting the discrimination capacity of features for targeting polyp objects. Through synergistic combination of RB, MCE and PACF modules, semantic-spatial correlations on polyp lesions within each frame could be established. Coupled with the context-free loss, our model merged feature representations of neighboring frames to diminish the dependency on varying contexts within consecutive frames and strengthen its robustness. Extensive experiments substantiated that our model with 100% annotation ratio achieved state-of-the-art performance on challenging datasets. Even trained under 50% annotation ratio, our model exceled significantly existing state-of-the-art image-based and video-based polyp segmentation models on the newly-built local TRPolyp dataset by at least 1.3% and 0.9% enhancements in both mDice and mIoU, whilst exhibited comparable performance to top rivals attained through using fully supervised approach on publicly available CVC-612, CVC-300 and ASU-Mayo-Clinic benchmarks. Notably, our model showcased exceptionally well in videos containing complex scenarios like motion blur and occlusion. Beyond that, it also harvested approximately 0.794 mDice and 0.707 mIoU at an inference of 0.036 s per frame in endoscopist-machine competition, which outperformed junior and senior endoscopists as well as almost matched with those of expert ones. The strong capability of the proposed STCSR-Net held promise in improving quality of VPS, accentuating model’s adaptability and potential in real-world clinical scenarios.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"100 ","pages":"Article 107127"},"PeriodicalIF":4.9000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semi-supervised spatial-temporal calibration and semantic refinement network for video polyp segmentation\",\"authors\":\"Feng Li , Zetao Huang , Lu Zhou , Haixia Peng , Yimin Chu\",\"doi\":\"10.1016/j.bspc.2024.107127\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automated video polyp segmentation (VPS) was of vitality for the early prevention and diagnosis of colorectal cancer (CRC). However, existing deep learning-based automatic polyp segmentation methods mainly focused on independent static images and struggled to perform well due to neglecting spatial–temporal relationships among successive video frames, while requiring massive frame-by-frame annotations. To better alleviate these challenges, we proposed a novel semi-supervised spatial–temporal calibration and semantic refinement network (STCSR-Net) dedicated to VPS, which simultaneously considered both the inter-frame temporal consistency in video clips and intra-frame semantic-spatial information. It was composed of a segmentation pathway and a propagation pathway by use of a co-training scheme for supervising the predictions on un-annotated images in a semi-supervised learning fashion. Specifically, we proposed an adaptive sequence calibration (ASC) block in segmentation pathway and a dynamic transmission calibration (DTC) block in propagation pathway to fully take advantage of valuable temporal cues and maintain the prediction temporally consistent among consecutive frames. Meanwhile, in these two branches, we introduced residual block (RB) to suppress irrelevant noisy information and highlight rich local boundary details of polyp lesions, while constructed multi-scale context extraction (MCE) module to enhance multi-scale high-level semantic feature expression. On that basis, we designed progressive adaptive context fusion (PACF) module to gradually aggregate multi-level features under the guidance of reinforced high-level semantic information for eliminating semantic gaps among them and promoting the discrimination capacity of features for targeting polyp objects. Through synergistic combination of RB, MCE and PACF modules, semantic-spatial correlations on polyp lesions within each frame could be established. Coupled with the context-free loss, our model merged feature representations of neighboring frames to diminish the dependency on varying contexts within consecutive frames and strengthen its robustness. Extensive experiments substantiated that our model with 100% annotation ratio achieved state-of-the-art performance on challenging datasets. Even trained under 50% annotation ratio, our model exceled significantly existing state-of-the-art image-based and video-based polyp segmentation models on the newly-built local TRPolyp dataset by at least 1.3% and 0.9% enhancements in both mDice and mIoU, whilst exhibited comparable performance to top rivals attained through using fully supervised approach on publicly available CVC-612, CVC-300 and ASU-Mayo-Clinic benchmarks. Notably, our model showcased exceptionally well in videos containing complex scenarios like motion blur and occlusion. Beyond that, it also harvested approximately 0.794 mDice and 0.707 mIoU at an inference of 0.036 s per frame in endoscopist-machine competition, which outperformed junior and senior endoscopists as well as almost matched with those of expert ones. The strong capability of the proposed STCSR-Net held promise in improving quality of VPS, accentuating model’s adaptability and potential in real-world clinical scenarios.</div></div>\",\"PeriodicalId\":55362,\"journal\":{\"name\":\"Biomedical Signal Processing and Control\",\"volume\":\"100 \",\"pages\":\"Article 107127\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2024-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biomedical Signal Processing and Control\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1746809424011856\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1746809424011856","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

摘要

自动视频息肉分割（VPS）对于结直肠癌（CRC）的早期预防和诊断至关重要。然而，现有的基于深度学习的息肉自动分割方法主要针对独立的静态图像，由于忽略了连续视频帧之间的时空关系，同时需要大量的逐帧注释，因此难以取得良好的效果。为了更好地应对这些挑战，我们提出了一种新型的半监督时空校准和语义细化网络（STCSR-Net），专门用于 VPS，它同时考虑了视频片段帧间的时空一致性和帧内的语义空间信息。该网络由分割路径和传播路径组成，采用共同训练方案，以半监督学习的方式对未标注图像的预测进行监督。具体来说，我们在分割路径中提出了自适应序列校准（ASC）块，在传播路径中提出了动态传输校准（DTC）块，以充分利用有价值的时间线索，保持连续帧间预测的时间一致性。同时，在这两个分支中，我们引入了残差块（RB）来抑制无关的噪声信息，突出息肉病变丰富的局部边界细节，并构建了多尺度上下文提取（MCE）模块来增强多尺度高层次语义特征的表达。在此基础上，我们设计了渐进式自适应上下文融合（PACF）模块，在强化的高层次语义信息指导下逐步聚合多层次特征，消除特征间的语义空白，提高特征对息肉对象的识别能力。通过 RB、MCE 和 PACF 模块的协同组合，可以建立每帧内息肉病变的语义空间相关性。结合无上下文损失，我们的模型合并了相邻帧的特征表征，从而降低了对连续帧内不同上下文的依赖性，增强了模型的鲁棒性。广泛的实验证明，在具有挑战性的数据集上，我们的 100%注释率模型取得了最先进的性能。即使在注释率为 50% 的情况下进行训练，我们的模型在新建立的本地 TRPolyp 数据集上也明显优于现有的一流图像息肉分割模型和视频息肉分割模型，在 mDice 和 mIoU 两项指标上分别提高了至少 1.3% 和 0.9%，同时在公开的 CVC-612、CVC-300 和 ASU-Mayo-Clinic 基准上，我们的模型与采用完全监督方法的顶尖对手表现相当。值得注意的是，我们的模型在包含运动模糊和闭塞等复杂场景的视频中表现出了卓越的性能。此外，它还在内窥镜医师-机器竞赛中以每帧 0.036 秒的推理时间收获了约 0.794 mDice 和 0.707 mIoU，表现优于初级和高级内窥镜医师，几乎与专家级内窥镜医师不相上下。所提出的 STCSR-Net 的强大能力有望提高 VPS 的质量，突出了模型在真实世界临床场景中的适应性和潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Semi-supervised spatial-temporal calibration and semantic refinement network for video polyp segmentation

Automated video polyp segmentation (VPS) was of vitality for the early prevention and diagnosis of colorectal cancer (CRC). However, existing deep learning-based automatic polyp segmentation methods mainly focused on independent static images and struggled to perform well due to neglecting spatial–temporal relationships among successive video frames, while requiring massive frame-by-frame annotations. To better alleviate these challenges, we proposed a novel semi-supervised spatial–temporal calibration and semantic refinement network (STCSR-Net) dedicated to VPS, which simultaneously considered both the inter-frame temporal consistency in video clips and intra-frame semantic-spatial information. It was composed of a segmentation pathway and a propagation pathway by use of a co-training scheme for supervising the predictions on un-annotated images in a semi-supervised learning fashion. Specifically, we proposed an adaptive sequence calibration (ASC) block in segmentation pathway and a dynamic transmission calibration (DTC) block in propagation pathway to fully take advantage of valuable temporal cues and maintain the prediction temporally consistent among consecutive frames. Meanwhile, in these two branches, we introduced residual block (RB) to suppress irrelevant noisy information and highlight rich local boundary details of polyp lesions, while constructed multi-scale context extraction (MCE) module to enhance multi-scale high-level semantic feature expression. On that basis, we designed progressive adaptive context fusion (PACF) module to gradually aggregate multi-level features under the guidance of reinforced high-level semantic information for eliminating semantic gaps among them and promoting the discrimination capacity of features for targeting polyp objects. Through synergistic combination of RB, MCE and PACF modules, semantic-spatial correlations on polyp lesions within each frame could be established. Coupled with the context-free loss, our model merged feature representations of neighboring frames to diminish the dependency on varying contexts within consecutive frames and strengthen its robustness. Extensive experiments substantiated that our model with 100% annotation ratio achieved state-of-the-art performance on challenging datasets. Even trained under 50% annotation ratio, our model exceled significantly existing state-of-the-art image-based and video-based polyp segmentation models on the newly-built local TRPolyp dataset by at least 1.3% and 0.9% enhancements in both mDice and mIoU, whilst exhibited comparable performance to top rivals attained through using fully supervised approach on publicly available CVC-612, CVC-300 and ASU-Mayo-Clinic benchmarks. Notably, our model showcased exceptionally well in videos containing complex scenarios like motion blur and occlusion. Beyond that, it also harvested approximately 0.794 mDice and 0.707 mIoU at an inference of 0.036 s per frame in endoscopist-machine competition, which outperformed junior and senior endoscopists as well as almost matched with those of expert ones. The strong capability of the proposed STCSR-Net held promise in improving quality of VPS, accentuating model’s adaptability and potential in real-world clinical scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biomedical Signal Processing and Control 工程技术-工程：生物医学

CiteScore

9.80

自引率

13.70%

发文量

822

审稿时长

4 months

期刊介绍： Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management. Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.