{"title":"MaS-TransUNet: A Multiattention Swin Transformer U-Net for Medical Image Segmentation","authors":"Ashwini Kumar Upadhyay;Ashish Kumar Bhandari","doi":"10.1109/TRPMS.2024.3477528","DOIUrl":null,"url":null,"abstract":"U-shaped encoder-decoder models have excelled in automatic medical image segmentation due to their hierarchical feature learning capabilities, robustness, and upgradability. Purely CNN-based models are excellent at extracting local details but struggle with long-range dependencies, whereas transformer-based models excel in global context modeling but have higher data and computational requirements. Self-attention-based transformers and other attention mechanisms have been shown to enhance segmentation accuracy in the encoder-decoder framework. Drawing from these challenges and opportunities, we propose a novel multiattention Swin transformer U-net (MaS-TransUNet) model, incorporating self-attention, edge attention, channel attention, and feedback attention. MaS-TransUNet leverages the strengths of both CNNs and transformers within a U-shaped encoder-decoder framework. For self-attention, we developed modules using Swin Transformer blocks, offering hierarchical feature representations. We designed specialized modules, including an edge attention module (EAM) to guide the network with edge information, a feedback attention module (FAM) to utilize previous epoch segmentation masks for refining subsequent predictions, and a channel attention module (CAM) to focus on relevant feature channels. We also introduced advanced data augmentation, regularizations, and an optimal training scheme for enhanced training. Comprehensive experiments across five diverse medical image segmentation datasets demonstrate that MaS-TransUNet significantly outperforms existing state-of-the-art methods while maintaining computational efficiency. It achieves the highest-Dice scores of 0.903, 0.841, 0.908, 0.906, and 0.906 on the Cancer genome atlas low-grade glioma Brain MRI, COVID-19 Lung CT, data science bowl-2018, Kvasir-SEG, and international skin imaging collaboration-2018 datasets, respectively. These results highlight the model’s robustness and versatility, consistently delivering exceptional performance without modality-specific adaptations.","PeriodicalId":46807,"journal":{"name":"IEEE Transactions on Radiation and Plasma Medical Sciences","volume":"9 5","pages":"613-626"},"PeriodicalIF":4.6000,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Radiation and Plasma Medical Sciences","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10713266/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
U-shaped encoder-decoder models have excelled in automatic medical image segmentation due to their hierarchical feature learning capabilities, robustness, and upgradability. Purely CNN-based models are excellent at extracting local details but struggle with long-range dependencies, whereas transformer-based models excel in global context modeling but have higher data and computational requirements. Self-attention-based transformers and other attention mechanisms have been shown to enhance segmentation accuracy in the encoder-decoder framework. Drawing from these challenges and opportunities, we propose a novel multiattention Swin transformer U-net (MaS-TransUNet) model, incorporating self-attention, edge attention, channel attention, and feedback attention. MaS-TransUNet leverages the strengths of both CNNs and transformers within a U-shaped encoder-decoder framework. For self-attention, we developed modules using Swin Transformer blocks, offering hierarchical feature representations. We designed specialized modules, including an edge attention module (EAM) to guide the network with edge information, a feedback attention module (FAM) to utilize previous epoch segmentation masks for refining subsequent predictions, and a channel attention module (CAM) to focus on relevant feature channels. We also introduced advanced data augmentation, regularizations, and an optimal training scheme for enhanced training. Comprehensive experiments across five diverse medical image segmentation datasets demonstrate that MaS-TransUNet significantly outperforms existing state-of-the-art methods while maintaining computational efficiency. It achieves the highest-Dice scores of 0.903, 0.841, 0.908, 0.906, and 0.906 on the Cancer genome atlas low-grade glioma Brain MRI, COVID-19 Lung CT, data science bowl-2018, Kvasir-SEG, and international skin imaging collaboration-2018 datasets, respectively. These results highlight the model’s robustness and versatility, consistently delivering exceptional performance without modality-specific adaptations.