{"title":"A Hybrid Medical Image Semantic Segmentation Network Based on Novel Mamba and Transformer","authors":"Jianting Shi, Huanhuan Liu, Zhijun Li","doi":"10.1049/ipr2.70205","DOIUrl":null,"url":null,"abstract":"<p>Recently, deep learning has greatly advanced medical image segmentation. Convolutional neural networks (CNNs) excel in capturing local image features, whereas ViT adeptly models long-range dependencies through multi-head self-attention mechanisms. Despite their strengths, both CNN and ViT face challenges in efficiently processing long-range dependencies in medical images and often require substantial computational resources. To address this, we propose a novel hybrid model combining Mamba and Transformer architectures. Our model integrates ViT's self-attention modules within a pure-vision Mamba U-shaped encoder, capturing both global and local information through nested Transformer and Mamba modules. Additionally, a multi-scale feed-forward neural network is incorporated within the Mamba blocks to enhance feature diversity by capturing fine-grained local details. Finally, a channel-adaptive feature (CAF) fusion module is introduced at the original skip connections to mitigate feature loss during information fusion and to improve segmentation accuracy in boundary regions. Quantitative and qualitative experiments were conducted on two public datasets: breast ultrasound image (BUSI) and clinic. The Dice score, Intersection over Union (IoU) score, recall score, <i>F</i>1 score and 95% Hausdorff distance (HD95) of the proposed model on the BUSI dataset were 0.7918, 0.7016, 0.8508, 0.7919 and 12.04 mm, respectively. On ClinicDB, these metrics reach 0.9239, 0.8671, 0.9278, 0.9239 and 5.49 mm, respectively. The proposed model outperforms existing state-of-the-art CNN-, Transformer- and Mamba-based methods in segmentation accuracy, according to experimental data.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70205","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Image Processing","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.70205","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, deep learning has greatly advanced medical image segmentation. Convolutional neural networks (CNNs) excel in capturing local image features, whereas ViT adeptly models long-range dependencies through multi-head self-attention mechanisms. Despite their strengths, both CNN and ViT face challenges in efficiently processing long-range dependencies in medical images and often require substantial computational resources. To address this, we propose a novel hybrid model combining Mamba and Transformer architectures. Our model integrates ViT's self-attention modules within a pure-vision Mamba U-shaped encoder, capturing both global and local information through nested Transformer and Mamba modules. Additionally, a multi-scale feed-forward neural network is incorporated within the Mamba blocks to enhance feature diversity by capturing fine-grained local details. Finally, a channel-adaptive feature (CAF) fusion module is introduced at the original skip connections to mitigate feature loss during information fusion and to improve segmentation accuracy in boundary regions. Quantitative and qualitative experiments were conducted on two public datasets: breast ultrasound image (BUSI) and clinic. The Dice score, Intersection over Union (IoU) score, recall score, F1 score and 95% Hausdorff distance (HD95) of the proposed model on the BUSI dataset were 0.7918, 0.7016, 0.8508, 0.7919 and 12.04 mm, respectively. On ClinicDB, these metrics reach 0.9239, 0.8671, 0.9278, 0.9239 and 5.49 mm, respectively. The proposed model outperforms existing state-of-the-art CNN-, Transformer- and Mamba-based methods in segmentation accuracy, according to experimental data.
近年来,深度学习极大地推进了医学图像分割。卷积神经网络(cnn)擅长捕获局部图像特征,而ViT则擅长通过多头自注意机制建模远程依赖关系。尽管具有优势,CNN和ViT在有效处理医学图像中的远程依赖关系方面都面临挑战,并且通常需要大量的计算资源。为了解决这个问题,我们提出了一个结合Mamba和Transformer架构的新型混合模型。我们的模型将ViT的自关注模块集成在纯视觉曼巴u形编码器中,通过嵌套的变压器和曼巴模块捕获全局和局部信息。此外,一个多尺度前馈神经网络被纳入曼巴块,通过捕获细粒度的局部细节来增强特征多样性。最后,在原跳点处引入信道自适应特征融合模块,以减轻信息融合过程中的特征损失,提高边界区域的分割精度。在乳房超声图像(BUSI)和临床两个公共数据集上进行定量和定性实验。该模型在BUSI数据集上的Dice得分、Intersection over Union (IoU)得分、recall得分、F1得分和95% Hausdorff距离(HD95)分别为0.7918、0.7016、0.8508、0.7919和12.04 mm。在ClinicDB上,这些指标分别达到0.9239、0.8671、0.9278、0.9239和5.49 mm。根据实验数据,所提出的模型在分割精度上优于现有的基于CNN、Transformer和mamba的方法。
期刊介绍:
The IET Image Processing journal encompasses research areas related to the generation, processing and communication of visual information. The focus of the journal is the coverage of the latest research results in image and video processing, including image generation and display, enhancement and restoration, segmentation, colour and texture analysis, coding and communication, implementations and architectures as well as innovative applications.
Principal topics include:
Generation and Display - Imaging sensors and acquisition systems, illumination, sampling and scanning, quantization, colour reproduction, image rendering, display and printing systems, evaluation of image quality.
Processing and Analysis - Image enhancement, restoration, segmentation, registration, multispectral, colour and texture processing, multiresolution processing and wavelets, morphological operations, stereoscopic and 3-D processing, motion detection and estimation, video and image sequence processing.
Implementations and Architectures - Image and video processing hardware and software, design and construction, architectures and software, neural, adaptive, and fuzzy processing.
Coding and Transmission - Image and video compression and coding, compression standards, noise modelling, visual information networks, streamed video.
Retrieval and Multimedia - Storage of images and video, database design, image retrieval, video annotation and editing, mixed media incorporating visual information, multimedia systems and applications, image and video watermarking, steganography.
Applications - Innovative application of image and video processing technologies to any field, including life sciences, earth sciences, astronomy, document processing and security.
Current Special Issue Call for Papers:
Evolutionary Computation for Image Processing - https://digital-library.theiet.org/files/IET_IPR_CFP_EC.pdf
AI-Powered 3D Vision - https://digital-library.theiet.org/files/IET_IPR_CFP_AIPV.pdf
Multidisciplinary advancement of Imaging Technologies: From Medical Diagnostics and Genomics to Cognitive Machine Vision, and Artificial Intelligence - https://digital-library.theiet.org/files/IET_IPR_CFP_IST.pdf
Deep Learning for 3D Reconstruction - https://digital-library.theiet.org/files/IET_IPR_CFP_DLR.pdf