{"title":"SoftFormer:用于城市土地利用和土地覆被分类的合成孔径雷达-光学融合变换器","authors":"Rui Liu , Jing Ling , Hongsheng Zhang","doi":"10.1016/j.isprsjprs.2024.09.012","DOIUrl":null,"url":null,"abstract":"<div><p>Classification of urban land use and land cover is vital to many applications, and naturally becomes a popular topic in remote sensing. The finite information carried by unimodal data, the compound land use types, and the poor signal-noise ratio caused by restricted weather conditions would inevitably lead to relatively poor classification performance. Recently in remote sensing society, multimodal data fusion with deep learning technology has gained a great deal of attention. Existing research exhibit integration of multimodal data at a single level, while simultaneously lacking exploration of the immense potential provided by popular transformer and CNN structures for effectively leveraging multimodal data, which may fall into the trap that makes the information fusion inadequate. We introduce SoftFormer, a novel network that synergistically merges the strengths of CNNs with transformers, as well as achieving multi-level fusion. To extract local features from images, we propose an innovative mechanism called Interior Self-Attention, which is seamlessly integrated into the backbone network. To fully exploit the global semantic information from both modalities, in the feature-level fusion, we introduce a joint key–value learning fusion approach to integrate multimodal data within a unified semantic space. The decision and feature level information are simultaneously integrated, resulting in a multi-level fusion transformer network. Results on four remote sensing datasets show that SoftFormer is able to achieve at least 1.32%, 0.7%, and 0.99% performance improvement in overall accuracy, kappa index, and mIoU, compared to other state-of-the-art methods, the ablation studies show that multimodal fusion outperforms the unimodal data on urban land cover and land use classification, the highest overall accuracy, kappa index as well as mIoU improvement can be up to 5.71%, 10.32% and 7.91%, and the proposed modules are able to boost performance to some extent, even with cloud cover. Code will be publicly available at <span><span>https://github.com/rl1024/SoftFormer</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":10,"journal":{"name":"ACS Central Science","volume":null,"pages":null},"PeriodicalIF":12.7000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SoftFormer: SAR-optical fusion transformer for urban land use and land cover classification\",\"authors\":\"Rui Liu , Jing Ling , Hongsheng Zhang\",\"doi\":\"10.1016/j.isprsjprs.2024.09.012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Classification of urban land use and land cover is vital to many applications, and naturally becomes a popular topic in remote sensing. The finite information carried by unimodal data, the compound land use types, and the poor signal-noise ratio caused by restricted weather conditions would inevitably lead to relatively poor classification performance. Recently in remote sensing society, multimodal data fusion with deep learning technology has gained a great deal of attention. Existing research exhibit integration of multimodal data at a single level, while simultaneously lacking exploration of the immense potential provided by popular transformer and CNN structures for effectively leveraging multimodal data, which may fall into the trap that makes the information fusion inadequate. We introduce SoftFormer, a novel network that synergistically merges the strengths of CNNs with transformers, as well as achieving multi-level fusion. To extract local features from images, we propose an innovative mechanism called Interior Self-Attention, which is seamlessly integrated into the backbone network. To fully exploit the global semantic information from both modalities, in the feature-level fusion, we introduce a joint key–value learning fusion approach to integrate multimodal data within a unified semantic space. The decision and feature level information are simultaneously integrated, resulting in a multi-level fusion transformer network. Results on four remote sensing datasets show that SoftFormer is able to achieve at least 1.32%, 0.7%, and 0.99% performance improvement in overall accuracy, kappa index, and mIoU, compared to other state-of-the-art methods, the ablation studies show that multimodal fusion outperforms the unimodal data on urban land cover and land use classification, the highest overall accuracy, kappa index as well as mIoU improvement can be up to 5.71%, 10.32% and 7.91%, and the proposed modules are able to boost performance to some extent, even with cloud cover. Code will be publicly available at <span><span>https://github.com/rl1024/SoftFormer</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":10,\"journal\":{\"name\":\"ACS Central Science\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":12.7000,\"publicationDate\":\"2024-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Central Science\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0924271624003502\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Central Science","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271624003502","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
SoftFormer: SAR-optical fusion transformer for urban land use and land cover classification
Classification of urban land use and land cover is vital to many applications, and naturally becomes a popular topic in remote sensing. The finite information carried by unimodal data, the compound land use types, and the poor signal-noise ratio caused by restricted weather conditions would inevitably lead to relatively poor classification performance. Recently in remote sensing society, multimodal data fusion with deep learning technology has gained a great deal of attention. Existing research exhibit integration of multimodal data at a single level, while simultaneously lacking exploration of the immense potential provided by popular transformer and CNN structures for effectively leveraging multimodal data, which may fall into the trap that makes the information fusion inadequate. We introduce SoftFormer, a novel network that synergistically merges the strengths of CNNs with transformers, as well as achieving multi-level fusion. To extract local features from images, we propose an innovative mechanism called Interior Self-Attention, which is seamlessly integrated into the backbone network. To fully exploit the global semantic information from both modalities, in the feature-level fusion, we introduce a joint key–value learning fusion approach to integrate multimodal data within a unified semantic space. The decision and feature level information are simultaneously integrated, resulting in a multi-level fusion transformer network. Results on four remote sensing datasets show that SoftFormer is able to achieve at least 1.32%, 0.7%, and 0.99% performance improvement in overall accuracy, kappa index, and mIoU, compared to other state-of-the-art methods, the ablation studies show that multimodal fusion outperforms the unimodal data on urban land cover and land use classification, the highest overall accuracy, kappa index as well as mIoU improvement can be up to 5.71%, 10.32% and 7.91%, and the proposed modules are able to boost performance to some extent, even with cloud cover. Code will be publicly available at https://github.com/rl1024/SoftFormer.
期刊介绍:
ACS Central Science publishes significant primary reports on research in chemistry and allied fields where chemical approaches are pivotal. As the first fully open-access journal by the American Chemical Society, it covers compelling and important contributions to the broad chemistry and scientific community. "Central science," a term popularized nearly 40 years ago, emphasizes chemistry's central role in connecting physical and life sciences, and fundamental sciences with applied disciplines like medicine and engineering. The journal focuses on exceptional quality articles, addressing advances in fundamental chemistry and interdisciplinary research.