In recent years, deep convolutional neural networks (CNNs) have achieved great successes in medical imaging. However, it is difficult to obtain accurate pathological information for clinical diagnosis and treatment by leveraging single-modality medical images. This study aims to provide an efficient multimodality whole heart segmentation method for the diagnosis of coronary heart disease.
We propose SFAM-TransUnet for multimodality whole heart segmentation, a novel deep learning framework combining CNNs and transformers. Primarily, the method integrates CNNs and visual transformers (Vits) into a unified fusion framework. Specifically, the shallow feature fusion module is designed to connect MRI and CT images, thereby providing a powerful and efficient multimodality fusion backbone for semantic segmentation. Furthermore, we propose a fusion ViT (FViT) module including self-attention (SA) and adaptive mutual boost attention (Ada-MBA) to enhance contextual information within and across modalities. The Ada-MBA module assigns attention to semantic perception regions by calculating SA and cross-attention, which improves the ability to understand context from the different modalities. Extensive experiments are conducted on the clinical Multi-Modality Whole Heart Segmentation datasets.
We successfully improved the whole heart segmentation DSCs to 0.902 (AA), 0.920 (LV-blood), 0.863 (LA-blood), and 0.837 (LV-myo), the HDs to 9.886 (AA), 9.947 (LV-blood), 11.911 (LA-blood), and 13.599 (LV-myo), the PSNR values to 33.577 (AA), 30.091 (LV-blood), 32.055 (LA-blood), and 29.837 (LV-myo), SSMI values to 0.901 (AA), 0.818 (LV-blood), 0.765 (LA-blood), and 0.743 (LV-myo). This demonstrate SFAM-TransUnet outperforms various alternative methods.
We propose SFAM-TransUnet, an efficient framework tailored for whole heart segmentation that combines CNNs and transformers. It provides a powerful multimodality fusion network to improve the performance of whole heart semantic segmentation. These results demonstrate the efficacy of SFAM-TransUnet in integrating relevant information between different modalities in multimodal tasks.


