Jialu Li, Lei Zhu, Zhaohu Xing, Baoliang Zhao, Ying Hu, Faqin Lv, Qiong Wang
{"title":"用于超声波视频对象分割的级联内-外夹式重构器","authors":"Jialu Li, Lei Zhu, Zhaohu Xing, Baoliang Zhao, Ying Hu, Faqin Lv, Qiong Wang","doi":"10.1109/JBHI.2024.3464732","DOIUrl":null,"url":null,"abstract":"<p><p>Computer-aided ultrasound (US) imaging is an important prerequisite for early clinical diagnosis and treatment. Due to the harsh ultrasound (US) image quality and the blurry tumor area, recent memory-based video object segmentation models (VOS) achieve frame-level segmentation by performing intensive similarity matching among the past frames which could inevitably result in computational redundancy. Furthermore, the current attention mechanism utilized in recent models only allocates the same attention level among whole spatial-temporal memory features without making distinctions, which may result in accuracy degradation. In this paper, we first build a larger annotated benchmark dataset for breast lesion segmentation in ultrasound videos, then we propose a lightweight clip-level VOS framework for achieving higher segmentation accuracy while maintaining the speed. The Inner-Outer Clip Retformer is proposed to extract spatialtemporal tumor features in parallel. Specifically, the proposed Outer Clip Retformer extracts the tumor movement feature from past video clips to locate the current clip tumor position, while the Inner Clip Retformer detailedly extracts current tumor features that can produce more accurate segmentation results. Then a Clip Contrastive loss function is further proposed to align the extracted tumor features along both the spatial-temporal dimensions to improve the segmentation accuracy. In addition, the Global Retentive Memory is proposed to maintain the complementary tumor features with lower computing resources which can generate coherent temporal movement features. In this way, our model can significantly improve the spatial-temporal perception ability without increasing a large number of parameters, achieving more accurate segmentation results while maintaining a faster segmentation speed. Finally, we conduct extensive experiments to evaluate our proposed model on several video object segmentation datasets, the results show that our framework outperforms state-of-theart segmentation methods.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cascaded Inner-Outer Clip Retformer for Ultrasound Video Object Segmentation.\",\"authors\":\"Jialu Li, Lei Zhu, Zhaohu Xing, Baoliang Zhao, Ying Hu, Faqin Lv, Qiong Wang\",\"doi\":\"10.1109/JBHI.2024.3464732\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Computer-aided ultrasound (US) imaging is an important prerequisite for early clinical diagnosis and treatment. Due to the harsh ultrasound (US) image quality and the blurry tumor area, recent memory-based video object segmentation models (VOS) achieve frame-level segmentation by performing intensive similarity matching among the past frames which could inevitably result in computational redundancy. Furthermore, the current attention mechanism utilized in recent models only allocates the same attention level among whole spatial-temporal memory features without making distinctions, which may result in accuracy degradation. In this paper, we first build a larger annotated benchmark dataset for breast lesion segmentation in ultrasound videos, then we propose a lightweight clip-level VOS framework for achieving higher segmentation accuracy while maintaining the speed. The Inner-Outer Clip Retformer is proposed to extract spatialtemporal tumor features in parallel. Specifically, the proposed Outer Clip Retformer extracts the tumor movement feature from past video clips to locate the current clip tumor position, while the Inner Clip Retformer detailedly extracts current tumor features that can produce more accurate segmentation results. Then a Clip Contrastive loss function is further proposed to align the extracted tumor features along both the spatial-temporal dimensions to improve the segmentation accuracy. In addition, the Global Retentive Memory is proposed to maintain the complementary tumor features with lower computing resources which can generate coherent temporal movement features. In this way, our model can significantly improve the spatial-temporal perception ability without increasing a large number of parameters, achieving more accurate segmentation results while maintaining a faster segmentation speed. Finally, we conduct extensive experiments to evaluate our proposed model on several video object segmentation datasets, the results show that our framework outperforms state-of-theart segmentation methods.</p>\",\"PeriodicalId\":13073,\"journal\":{\"name\":\"IEEE Journal of Biomedical and Health Informatics\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2024-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Biomedical and Health Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1109/JBHI.2024.3464732\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/JBHI.2024.3464732","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
计算机辅助超声(US)成像是早期临床诊断和治疗的重要前提。由于超声波(US)图像质量苛刻且肿瘤区域模糊,近期基于内存的视频对象分割模型(VOS)通过在过去的帧之间进行密集的相似性匹配来实现帧级分割,这不可避免地会造成计算冗余。此外,目前最新模型所采用的注意力机制只是在整个时空记忆特征之间分配相同的注意力级别,而不进行区分,这可能会导致精度下降。在本文中,我们首先为超声视频中的乳腺病变分割建立了一个更大的标注基准数据集,然后提出了一个轻量级片段级 VOS 框架,以在保持速度的同时获得更高的分割精度。我们提出了内-外片段重构器(Inner-Outer Clip Retformer)来并行提取肿瘤的时空特征。具体来说,外片段重构器从过去的视频片段中提取肿瘤运动特征,从而定位当前片段的肿瘤位置,而内片段重构器则详细提取当前肿瘤特征,从而得出更准确的分割结果。然后,进一步提出 Clip Contrastive 损失函数,使提取的肿瘤特征在空间和时间维度上保持一致,从而提高分割精度。此外,我们还提出了全局保留记忆(Global Retentive Memory)技术,以较低的计算资源来保留互补的肿瘤特征,从而生成连贯的时间运动特征。这样,我们的模型就能在不增加大量参数的情况下显著提高时空感知能力,在保持较快的分割速度的同时获得更精确的分割结果。最后,我们在多个视频对象分割数据集上进行了大量实验,以评估我们提出的模型,结果表明我们的框架优于现有的分割方法。
Cascaded Inner-Outer Clip Retformer for Ultrasound Video Object Segmentation.
Computer-aided ultrasound (US) imaging is an important prerequisite for early clinical diagnosis and treatment. Due to the harsh ultrasound (US) image quality and the blurry tumor area, recent memory-based video object segmentation models (VOS) achieve frame-level segmentation by performing intensive similarity matching among the past frames which could inevitably result in computational redundancy. Furthermore, the current attention mechanism utilized in recent models only allocates the same attention level among whole spatial-temporal memory features without making distinctions, which may result in accuracy degradation. In this paper, we first build a larger annotated benchmark dataset for breast lesion segmentation in ultrasound videos, then we propose a lightweight clip-level VOS framework for achieving higher segmentation accuracy while maintaining the speed. The Inner-Outer Clip Retformer is proposed to extract spatialtemporal tumor features in parallel. Specifically, the proposed Outer Clip Retformer extracts the tumor movement feature from past video clips to locate the current clip tumor position, while the Inner Clip Retformer detailedly extracts current tumor features that can produce more accurate segmentation results. Then a Clip Contrastive loss function is further proposed to align the extracted tumor features along both the spatial-temporal dimensions to improve the segmentation accuracy. In addition, the Global Retentive Memory is proposed to maintain the complementary tumor features with lower computing resources which can generate coherent temporal movement features. In this way, our model can significantly improve the spatial-temporal perception ability without increasing a large number of parameters, achieving more accurate segmentation results while maintaining a faster segmentation speed. Finally, we conduct extensive experiments to evaluate our proposed model on several video object segmentation datasets, the results show that our framework outperforms state-of-theart segmentation methods.
期刊介绍:
IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.