Multimodal Sensitive Adaptive Transformer for 3D medical image segmentation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-06-16 DOI:10.1016/j.imavis.2025.105606

Zhibing Wang , Wenmin Wang , Nannan Li , Qi Chen , Yifan Zhang , Meng Xiao , Haomei Jia , Shenyong Zhang

{"title":"Multimodal Sensitive Adaptive Transformer for 3D medical image segmentation","authors":"Zhibing Wang , Wenmin Wang , Nannan Li , Qi Chen , Yifan Zhang , Meng Xiao , Haomei Jia , Shenyong Zhang","doi":"10.1016/j.imavis.2025.105606","DOIUrl":null,"url":null,"abstract":"<div><div>Three-dimensional medical imaging segmentation presents a significant challenge within the field, with the segmentation of multiple organs and lesions in MRI images being particularly demanding. This paper introduces an innovative approach utilizing the Multimodal Sensitive Adaptive Attention (MSAA). We refer to this new structure as the Multimodal Sensitive Adaptive Transformer Network (MSAT), which incorporates downsampling and Multimodal Sensitive Adaptive Attention into the encoding phase and integrate skip connections from different layers, outputs from Multimodal Sensitive Adaptive Attention, and upsampled feature outputs into the decoding phase. The MSAT consists of two primary components. The initial component is designed to extract a richer set of high-dimensional features through an advanced network architecture. This includes integration of different layers skip connections, outputs from the MSAA, and the results of the preceding upsampling layer. The second component features a Multimodal Sensitive Adaptive Attention block, which integrates two types of attention mechanisms: Local Sensitive Adaptive Attention (LSAA) and Spatial Sensitive Adaptive Attention (SSAA). These attention mechanisms work synergistically to blend high and low-dimensional features effectively, thereby enriching the contextual information captured by the model. Our experiments, conducted across several datasets including Synapse, BTCV, ACDC, and the BraTS 2021 dataset, demonstrate that the MSAT outperforms other existing methodologies. The MSAT shows superior segmentation capabilities for 3D multi-organ, cardiac, and brain tumor segmentation tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105606"},"PeriodicalIF":4.2000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001945","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Three-dimensional medical imaging segmentation presents a significant challenge within the field, with the segmentation of multiple organs and lesions in MRI images being particularly demanding. This paper introduces an innovative approach utilizing the Multimodal Sensitive Adaptive Attention (MSAA). We refer to this new structure as the Multimodal Sensitive Adaptive Transformer Network (MSAT), which incorporates downsampling and Multimodal Sensitive Adaptive Attention into the encoding phase and integrate skip connections from different layers, outputs from Multimodal Sensitive Adaptive Attention, and upsampled feature outputs into the decoding phase. The MSAT consists of two primary components. The initial component is designed to extract a richer set of high-dimensional features through an advanced network architecture. This includes integration of different layers skip connections, outputs from the MSAA, and the results of the preceding upsampling layer. The second component features a Multimodal Sensitive Adaptive Attention block, which integrates two types of attention mechanisms: Local Sensitive Adaptive Attention (LSAA) and Spatial Sensitive Adaptive Attention (SSAA). These attention mechanisms work synergistically to blend high and low-dimensional features effectively, thereby enriching the contextual information captured by the model. Our experiments, conducted across several datasets including Synapse, BTCV, ACDC, and the BraTS 2021 dataset, demonstrate that the MSAT outperforms other existing methodologies. The MSAT shows superior segmentation capabilities for 3D multi-organ, cardiac, and brain tumor segmentation tasks.

查看原文本刊更多论文

三维医学图像分割的多模态敏感自适应变压器

三维医学成像分割是该领域的一个重大挑战，特别是对MRI图像中多个器官和病变的分割要求特别高。本文介绍了一种利用多模态敏感自适应注意（MSAA）的创新方法。我们将这种新结构称为多模态敏感自适应变压器网络（MSAT），它将下采样和多模态敏感自适应注意集成到编码阶段，并将不同层的跳过连接、多模态敏感自适应注意输出和上采样特征输出集成到解码阶段。MSAT由两个主要部分组成。初始组件旨在通过先进的网络体系结构提取更丰富的高维特征集。这包括集成不同层的跳过连接、MSAA的输出和前面上采样层的结果。第二部分是多模态敏感自适应注意模块，该模块集成了两种类型的注意机制：局部敏感自适应注意（LSAA）和空间敏感自适应注意（SSAA）。这些注意机制协同工作，有效地混合了高维和低维特征，从而丰富了模型捕获的上下文信息。我们在多个数据集（包括Synapse、BTCV、ACDC和BraTS 2021数据集）上进行的实验表明，MSAT优于其他现有方法。MSAT在三维多器官、心脏和脑肿瘤分割任务中显示出优越的分割能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.