VSmTrans: A hybrid paradigm integrating self-attention and convolution for 3D medical image segmentation

IF 11.8 1区医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Medical image analysis Pub Date : 2024-08-24 DOI:10.1016/j.media.2024.103295

Tiange Liu , Qingze Bai , Drew A. Torigian , Yubing Tong , Jayaram K. Udupa

{"title":"VSmTrans: A hybrid paradigm integrating self-attention and convolution for 3D medical image segmentation","authors":"Tiange Liu , Qingze Bai , Drew A. Torigian , Yubing Tong , Jayaram K. Udupa","doi":"10.1016/j.media.2024.103295","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>Vision Transformers recently achieved a competitive performance compared with CNNs due to their excellent capability of learning global representation. However, there are two major challenges when applying them to 3D image segmentation: i) Because of the large size of 3D medical images, comprehensive global information is hard to capture due to the enormous computational costs. ii) Insufficient local inductive bias in Transformers affects the ability to segment detailed features such as ambiguous and subtly defined boundaries. Hence, to apply the Vision Transformer mechanism in the medical image segmentation field, the above challenges need to be overcome adequately.</p></div><div><h3>Methods</h3><p>We propose a hybrid paradigm, called Variable-Shape Mixed Transformer (VSmTrans), that integrates self-attention and convolution and can enjoy the benefits of free learning of both complex relationships from the self-attention mechanism and the local prior knowledge from convolution. Specifically, we designed a Variable-Shape self-attention mechanism, which can rapidly expand the receptive field without extra computing cost and achieve a good trade-off between global awareness and local details. In addition, the parallel convolution paradigm introduces strong local inductive bias to facilitate the ability to excavate details. Meanwhile, a pair of learnable parameters can automatically adjust the importance of the above two paradigms. Extensive experiments were conducted on two public medical image datasets with different modalities: the AMOS CT dataset and the BraTS2021 MRI dataset.</p></div><div><h3>Results</h3><p>Our method achieves the best average Dice scores of 88.3 % and 89.7 % on these datasets, which are superior to the previous state-of-the-art Swin Transformer-based and CNN-based architectures. A series of ablation experiments were also conducted to verify the efficiency of the proposed hybrid mechanism and the components and explore the effectiveness of those key parameters in VSmTrans.</p></div><div><h3>Conclusions</h3><p>The proposed hybrid Transformer-based backbone network for 3D medical image segmentation can tightly integrate self-attention and convolution to exploit the advantages of these two paradigms. The experimental results demonstrate our method's superiority compared to other state-of-the-art methods. The hybrid paradigm seems to be most appropriate to the medical image segmentation field. The ablation experiments also demonstrate that the proposed hybrid mechanism can effectively balance large receptive fields with local inductive biases, resulting in highly accurate segmentation results, especially in capturing details. Our code is available at https://github.com/qingze-bai/VSmTrans.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103295"},"PeriodicalIF":11.8000,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841524002202","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

Vision Transformers recently achieved a competitive performance compared with CNNs due to their excellent capability of learning global representation. However, there are two major challenges when applying them to 3D image segmentation: i) Because of the large size of 3D medical images, comprehensive global information is hard to capture due to the enormous computational costs. ii) Insufficient local inductive bias in Transformers affects the ability to segment detailed features such as ambiguous and subtly defined boundaries. Hence, to apply the Vision Transformer mechanism in the medical image segmentation field, the above challenges need to be overcome adequately.

Methods

We propose a hybrid paradigm, called Variable-Shape Mixed Transformer (VSmTrans), that integrates self-attention and convolution and can enjoy the benefits of free learning of both complex relationships from the self-attention mechanism and the local prior knowledge from convolution. Specifically, we designed a Variable-Shape self-attention mechanism, which can rapidly expand the receptive field without extra computing cost and achieve a good trade-off between global awareness and local details. In addition, the parallel convolution paradigm introduces strong local inductive bias to facilitate the ability to excavate details. Meanwhile, a pair of learnable parameters can automatically adjust the importance of the above two paradigms. Extensive experiments were conducted on two public medical image datasets with different modalities: the AMOS CT dataset and the BraTS2021 MRI dataset.

Results

Our method achieves the best average Dice scores of 88.3 % and 89.7 % on these datasets, which are superior to the previous state-of-the-art Swin Transformer-based and CNN-based architectures. A series of ablation experiments were also conducted to verify the efficiency of the proposed hybrid mechanism and the components and explore the effectiveness of those key parameters in VSmTrans.

Conclusions

The proposed hybrid Transformer-based backbone network for 3D medical image segmentation can tightly integrate self-attention and convolution to exploit the advantages of these two paradigms. The experimental results demonstrate our method's superiority compared to other state-of-the-art methods. The hybrid paradigm seems to be most appropriate to the medical image segmentation field. The ablation experiments also demonstrate that the proposed hybrid mechanism can effectively balance large receptive fields with local inductive biases, resulting in highly accurate segmentation results, especially in capturing details. Our code is available at https://github.com/qingze-bai/VSmTrans.

Abstract Image

查看原文本刊更多论文

VSmTrans：用于三维医学图像分割的自注意和卷积混合范式

目的Vision Transformers 凭借其出色的全局表征学习能力，最近取得了比 CNN 更具竞争力的性能。然而，将其应用于三维图像分割有两大挑战：i) 由于三维医学图像尺寸巨大，计算成本高昂，很难捕捉到全面的全局信息。因此，要将视觉变换器机制应用于医学影像分割领域，就必须充分克服上述挑战。我们提出了一种混合范式，称为可变形状混合变换器（VSmTrans），它集成了自注意和卷积，可以同时享受来自自注意机制的复杂关系和来自卷积的局部先验知识的自由学习优势。具体来说，我们设计了一种可变形状自注意机制，它能在不增加额外计算成本的情况下快速扩展感受野，并在全局意识和局部细节之间实现良好的权衡。此外，并行卷积范式还引入了强大的局部归纳偏差，以提高挖掘细节的能力。同时，一对可学习的参数可以自动调整上述两种范式的重要性。我们在两个不同模式的公共医疗图像数据集上进行了广泛的实验：AMOS CT 数据集和 BraTS2021 MRI 数据集。结果我们的方法在这些数据集上取得了 88.3 % 和 89.7 % 的最佳平均 Dice 分数，优于之前最先进的基于 Swin Transformer 和基于 CNN 的架构。我们还进行了一系列消融实验，以验证所提出的混合机制和组件的效率，并探索 VSmTrans 中这些关键参数的有效性。实验结果表明，我们的方法优于其他最先进的方法。混合范式似乎最适合医学图像分割领域。消融实验也证明，所提出的混合机制能有效平衡大感受野和局部感应偏差，从而获得高精度的分割结果，尤其是在捕捉细节方面。我们的代码见 https://github.com/qingze-bai/VSmTrans。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical image analysis 工程技术-工程：生物医学

CiteScore

22.10

自引率

6.40%

发文量

309

审稿时长

6.6 months

期刊介绍： Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.