{"title":"CFFormer:一种多源遥感图像语义分割的交叉融合变压器框架","authors":"Jinqi Zhao;Ming Zhang;Zhonghuai Zhou;Zixuan Wang;Fengkai Lang;Hongtao Shi;Nanshan Zheng","doi":"10.1109/TGRS.2024.3507274","DOIUrl":null,"url":null,"abstract":"Multisource remote sensing images (RSIs) can capture the complementary information of ground objects for use in semantic segmentation. However, there can be inconsistency and interference noise among the multimodal data from different sensors. Therefore, it is a challenge to effectively reduce the differences and noise between the different modalities and fully utilize their complementary features. In this article, we propose a universal cross-fusion transformer framework (CFFormer) for the semantic segmentation of multisource RSIs, adopting a parallel dual-stream structure to extract features separately from the different modalities. We introduce a feature correction module (FCM) that corrects the features of the current modality by combining features from the other modalities in both the spatial and channel dimensions. In the feature fusion module (FFM), we employ a multihead cross-attention mechanism to interact globally and fuse features from the different modalities, enabling the comprehensive utilization of the complementary information in multisource RSIs. Finally, comparative experiments demonstrate that the proposed CFFormer framework not only achieves state-of-the-art (SOTA) accuracy but also exhibits outstanding robustness when compared to the current advanced networks for semantic segmentation of multisource RSIs. Specifically, CFFormer achieves a mean intersection over union (mIoU) of 58% and an overall accuracy (OA) of 85.35% on the WHU-OPT-SAR dataset, outperforming the second-ranked network by 4.71% and 1.74%, respectively. On the Vaihingen and Potsdam datasets, CFFormer also achieves the best results, with mIoU and OA values of 84.31%/91.88% and 88.62%/92.64%, respectively. The source code is available at \n<uri>https://github.com/masurq/CFFormer</uri>\n.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-17"},"PeriodicalIF":8.6000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CFFormer: A Cross-Fusion Transformer Framework for the Semantic Segmentation of Multisource Remote Sensing Images\",\"authors\":\"Jinqi Zhao;Ming Zhang;Zhonghuai Zhou;Zixuan Wang;Fengkai Lang;Hongtao Shi;Nanshan Zheng\",\"doi\":\"10.1109/TGRS.2024.3507274\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multisource remote sensing images (RSIs) can capture the complementary information of ground objects for use in semantic segmentation. However, there can be inconsistency and interference noise among the multimodal data from different sensors. Therefore, it is a challenge to effectively reduce the differences and noise between the different modalities and fully utilize their complementary features. In this article, we propose a universal cross-fusion transformer framework (CFFormer) for the semantic segmentation of multisource RSIs, adopting a parallel dual-stream structure to extract features separately from the different modalities. We introduce a feature correction module (FCM) that corrects the features of the current modality by combining features from the other modalities in both the spatial and channel dimensions. In the feature fusion module (FFM), we employ a multihead cross-attention mechanism to interact globally and fuse features from the different modalities, enabling the comprehensive utilization of the complementary information in multisource RSIs. Finally, comparative experiments demonstrate that the proposed CFFormer framework not only achieves state-of-the-art (SOTA) accuracy but also exhibits outstanding robustness when compared to the current advanced networks for semantic segmentation of multisource RSIs. Specifically, CFFormer achieves a mean intersection over union (mIoU) of 58% and an overall accuracy (OA) of 85.35% on the WHU-OPT-SAR dataset, outperforming the second-ranked network by 4.71% and 1.74%, respectively. On the Vaihingen and Potsdam datasets, CFFormer also achieves the best results, with mIoU and OA values of 84.31%/91.88% and 88.62%/92.64%, respectively. The source code is available at \\n<uri>https://github.com/masurq/CFFormer</uri>\\n.\",\"PeriodicalId\":13213,\"journal\":{\"name\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"volume\":\"63 \",\"pages\":\"1-17\"},\"PeriodicalIF\":8.6000,\"publicationDate\":\"2024-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Geoscience and Remote Sensing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10786275/\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10786275/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
CFFormer: A Cross-Fusion Transformer Framework for the Semantic Segmentation of Multisource Remote Sensing Images
Multisource remote sensing images (RSIs) can capture the complementary information of ground objects for use in semantic segmentation. However, there can be inconsistency and interference noise among the multimodal data from different sensors. Therefore, it is a challenge to effectively reduce the differences and noise between the different modalities and fully utilize their complementary features. In this article, we propose a universal cross-fusion transformer framework (CFFormer) for the semantic segmentation of multisource RSIs, adopting a parallel dual-stream structure to extract features separately from the different modalities. We introduce a feature correction module (FCM) that corrects the features of the current modality by combining features from the other modalities in both the spatial and channel dimensions. In the feature fusion module (FFM), we employ a multihead cross-attention mechanism to interact globally and fuse features from the different modalities, enabling the comprehensive utilization of the complementary information in multisource RSIs. Finally, comparative experiments demonstrate that the proposed CFFormer framework not only achieves state-of-the-art (SOTA) accuracy but also exhibits outstanding robustness when compared to the current advanced networks for semantic segmentation of multisource RSIs. Specifically, CFFormer achieves a mean intersection over union (mIoU) of 58% and an overall accuracy (OA) of 85.35% on the WHU-OPT-SAR dataset, outperforming the second-ranked network by 4.71% and 1.74%, respectively. On the Vaihingen and Potsdam datasets, CFFormer also achieves the best results, with mIoU and OA values of 84.31%/91.88% and 88.62%/92.64%, respectively. The source code is available at
https://github.com/masurq/CFFormer
.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.