Bei Cheng;Zao Liu;Huxiao Tang;Qingwang Wang;Wenhao Chen;Tao Chen;Tao Shen
{"title":"遥感显著目标检测的多模态引导变压器结构","authors":"Bei Cheng;Zao Liu;Huxiao Tang;Qingwang Wang;Wenhao Chen;Tao Chen;Tao Shen","doi":"10.1109/LGRS.2025.3601083","DOIUrl":null,"url":null,"abstract":"The latest remote sensing image saliency detectors primarily rely on RGB information alone. However, spatial and geometric information embedded in depth images is robust to variations in lighting and color. Integrating depth information with RGB images can enhance the spatial structure of objects. In light of this, we innovatively propose a remote sensing image saliency detection model that fuses RGB and depth information, named the multimodal-guided transformer architecture (MGTA). Specifically, we first introduce the strongly correlated complementary fusion (SCCF) module to explore cross-modal consistency and similarity, maintaining consistency across different modalities while uncovering multidimensional common information. In addition, the global–local context information interaction (GLCII) module is designed to extract global semantic information and local detail information, effectively utilizing contextual information while reducing the number of parameters. Finally, a cascaded feature-guided decoder (CFGD) is employed to gradually fuse hierarchical decoding features, effectively integrating multilevel data and accurately locating target positions. Extensive experiments demonstrate that our proposed model outperforms 14 state-of-the-art methods. The code and results of our method are available at <uri>https://github.com/Zackisliuzao/MGTANet</uri>","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":4.4000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal-Guided Transformer Architecture for Remote Sensing Salient Object Detection\",\"authors\":\"Bei Cheng;Zao Liu;Huxiao Tang;Qingwang Wang;Wenhao Chen;Tao Chen;Tao Shen\",\"doi\":\"10.1109/LGRS.2025.3601083\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The latest remote sensing image saliency detectors primarily rely on RGB information alone. However, spatial and geometric information embedded in depth images is robust to variations in lighting and color. Integrating depth information with RGB images can enhance the spatial structure of objects. In light of this, we innovatively propose a remote sensing image saliency detection model that fuses RGB and depth information, named the multimodal-guided transformer architecture (MGTA). Specifically, we first introduce the strongly correlated complementary fusion (SCCF) module to explore cross-modal consistency and similarity, maintaining consistency across different modalities while uncovering multidimensional common information. In addition, the global–local context information interaction (GLCII) module is designed to extract global semantic information and local detail information, effectively utilizing contextual information while reducing the number of parameters. Finally, a cascaded feature-guided decoder (CFGD) is employed to gradually fuse hierarchical decoding features, effectively integrating multilevel data and accurately locating target positions. Extensive experiments demonstrate that our proposed model outperforms 14 state-of-the-art methods. The code and results of our method are available at <uri>https://github.com/Zackisliuzao/MGTANet</uri>\",\"PeriodicalId\":91017,\"journal\":{\"name\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"volume\":\"22 \",\"pages\":\"1-5\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11133601/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11133601/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multimodal-Guided Transformer Architecture for Remote Sensing Salient Object Detection
The latest remote sensing image saliency detectors primarily rely on RGB information alone. However, spatial and geometric information embedded in depth images is robust to variations in lighting and color. Integrating depth information with RGB images can enhance the spatial structure of objects. In light of this, we innovatively propose a remote sensing image saliency detection model that fuses RGB and depth information, named the multimodal-guided transformer architecture (MGTA). Specifically, we first introduce the strongly correlated complementary fusion (SCCF) module to explore cross-modal consistency and similarity, maintaining consistency across different modalities while uncovering multidimensional common information. In addition, the global–local context information interaction (GLCII) module is designed to extract global semantic information and local detail information, effectively utilizing contextual information while reducing the number of parameters. Finally, a cascaded feature-guided decoder (CFGD) is employed to gradually fuse hierarchical decoding features, effectively integrating multilevel data and accurately locating target positions. Extensive experiments demonstrate that our proposed model outperforms 14 state-of-the-art methods. The code and results of our method are available at https://github.com/Zackisliuzao/MGTANet