{"title":"SoftFormer:用于城市土地利用和土地覆被分类的合成孔径雷达-光学融合变换器","authors":"Rui Liu , Jing Ling , Hongsheng Zhang","doi":"10.1016/j.isprsjprs.2024.09.012","DOIUrl":null,"url":null,"abstract":"<div><p>Classification of urban land use and land cover is vital to many applications, and naturally becomes a popular topic in remote sensing. The finite information carried by unimodal data, the compound land use types, and the poor signal-noise ratio caused by restricted weather conditions would inevitably lead to relatively poor classification performance. Recently in remote sensing society, multimodal data fusion with deep learning technology has gained a great deal of attention. Existing research exhibit integration of multimodal data at a single level, while simultaneously lacking exploration of the immense potential provided by popular transformer and CNN structures for effectively leveraging multimodal data, which may fall into the trap that makes the information fusion inadequate. We introduce SoftFormer, a novel network that synergistically merges the strengths of CNNs with transformers, as well as achieving multi-level fusion. To extract local features from images, we propose an innovative mechanism called Interior Self-Attention, which is seamlessly integrated into the backbone network. To fully exploit the global semantic information from both modalities, in the feature-level fusion, we introduce a joint key–value learning fusion approach to integrate multimodal data within a unified semantic space. The decision and feature level information are simultaneously integrated, resulting in a multi-level fusion transformer network. Results on four remote sensing datasets show that SoftFormer is able to achieve at least 1.32%, 0.7%, and 0.99% performance improvement in overall accuracy, kappa index, and mIoU, compared to other state-of-the-art methods, the ablation studies show that multimodal fusion outperforms the unimodal data on urban land cover and land use classification, the highest overall accuracy, kappa index as well as mIoU improvement can be up to 5.71%, 10.32% and 7.91%, and the proposed modules are able to boost performance to some extent, even with cloud cover. Code will be publicly available at <span><span>https://github.com/rl1024/SoftFormer</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"218 ","pages":"Pages 277-293"},"PeriodicalIF":10.6000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SoftFormer: SAR-optical fusion transformer for urban land use and land cover classification\",\"authors\":\"Rui Liu , Jing Ling , Hongsheng Zhang\",\"doi\":\"10.1016/j.isprsjprs.2024.09.012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Classification of urban land use and land cover is vital to many applications, and naturally becomes a popular topic in remote sensing. The finite information carried by unimodal data, the compound land use types, and the poor signal-noise ratio caused by restricted weather conditions would inevitably lead to relatively poor classification performance. Recently in remote sensing society, multimodal data fusion with deep learning technology has gained a great deal of attention. Existing research exhibit integration of multimodal data at a single level, while simultaneously lacking exploration of the immense potential provided by popular transformer and CNN structures for effectively leveraging multimodal data, which may fall into the trap that makes the information fusion inadequate. We introduce SoftFormer, a novel network that synergistically merges the strengths of CNNs with transformers, as well as achieving multi-level fusion. To extract local features from images, we propose an innovative mechanism called Interior Self-Attention, which is seamlessly integrated into the backbone network. To fully exploit the global semantic information from both modalities, in the feature-level fusion, we introduce a joint key–value learning fusion approach to integrate multimodal data within a unified semantic space. The decision and feature level information are simultaneously integrated, resulting in a multi-level fusion transformer network. Results on four remote sensing datasets show that SoftFormer is able to achieve at least 1.32%, 0.7%, and 0.99% performance improvement in overall accuracy, kappa index, and mIoU, compared to other state-of-the-art methods, the ablation studies show that multimodal fusion outperforms the unimodal data on urban land cover and land use classification, the highest overall accuracy, kappa index as well as mIoU improvement can be up to 5.71%, 10.32% and 7.91%, and the proposed modules are able to boost performance to some extent, even with cloud cover. Code will be publicly available at <span><span>https://github.com/rl1024/SoftFormer</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":50269,\"journal\":{\"name\":\"ISPRS Journal of Photogrammetry and Remote Sensing\",\"volume\":\"218 \",\"pages\":\"Pages 277-293\"},\"PeriodicalIF\":10.6000,\"publicationDate\":\"2024-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ISPRS Journal of Photogrammetry and Remote Sensing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0924271624003502\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"GEOGRAPHY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271624003502","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}
SoftFormer: SAR-optical fusion transformer for urban land use and land cover classification
Classification of urban land use and land cover is vital to many applications, and naturally becomes a popular topic in remote sensing. The finite information carried by unimodal data, the compound land use types, and the poor signal-noise ratio caused by restricted weather conditions would inevitably lead to relatively poor classification performance. Recently in remote sensing society, multimodal data fusion with deep learning technology has gained a great deal of attention. Existing research exhibit integration of multimodal data at a single level, while simultaneously lacking exploration of the immense potential provided by popular transformer and CNN structures for effectively leveraging multimodal data, which may fall into the trap that makes the information fusion inadequate. We introduce SoftFormer, a novel network that synergistically merges the strengths of CNNs with transformers, as well as achieving multi-level fusion. To extract local features from images, we propose an innovative mechanism called Interior Self-Attention, which is seamlessly integrated into the backbone network. To fully exploit the global semantic information from both modalities, in the feature-level fusion, we introduce a joint key–value learning fusion approach to integrate multimodal data within a unified semantic space. The decision and feature level information are simultaneously integrated, resulting in a multi-level fusion transformer network. Results on four remote sensing datasets show that SoftFormer is able to achieve at least 1.32%, 0.7%, and 0.99% performance improvement in overall accuracy, kappa index, and mIoU, compared to other state-of-the-art methods, the ablation studies show that multimodal fusion outperforms the unimodal data on urban land cover and land use classification, the highest overall accuracy, kappa index as well as mIoU improvement can be up to 5.71%, 10.32% and 7.91%, and the proposed modules are able to boost performance to some extent, even with cloud cover. Code will be publicly available at https://github.com/rl1024/SoftFormer.
期刊介绍:
The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive.
P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields.
In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.