{"title":"Explicit Semantic Alignment Network for RGB-T salient object detection with Hierarchical Cross-Modal Fusion","authors":"Hongkuan Wang, Qingxi Yu, Zhenguang Di, Gang Yang","doi":"10.1016/j.imavis.2025.105730","DOIUrl":null,"url":null,"abstract":"<div><div>Existing RGB-T salient object detection methods primarily rely on the learning mechanism of neural networks to perform implicit cross-modal feature alignment, aiming to achieve complementary fusion of modal features. However, this implicit feature alignment method has two main limitations: first, it is prone to causing loss of the salient object’s structural information; second, it may lead to abnormal activation responses that are not related to the object. To address the above issues, we propose the innovative Explicit Semantic Alignment (ESA) framework and design the Explicit Semantic Alignment Network for RGB-T Salient Object Detection with Hierarchical Cross-Modal Fusion (ESANet). Specifically, we design a Saliency-Aware Refinement Module (SARM), which fuses high-level semantic features with mid-level spatial details through cross-aggregation and the dynamic integration module to achieve bidirectional interaction and adaptive fusion of cross-modal features. It also utilizes a cross-modal multi-head attention mechanism to generate fine-grained shared semantic information. Subsequently, the Cross-Modal Feature Alignment Module (CFAM) introduces a window-based attention propagation mechanism, which enforces consistency in scene understanding between RGB and thermal modalities by using shared semantics as an alignment constraint. Finally, the Semantic-Guided Edge Sharpening Module (SESM) combines shared semantics with a weight enhancement strategy to optimize the consistency of shallow cross-modal feature distributions. Experimental results demonstrate that ESANet significantly outperforms existing state-of-the-art RGB-T salient object detection methods on three public datasets, validating its excellent performance in salient object detection tasks. Our code will be released at <span><span>https://github.com/whklearn/ESANet.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105730"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S026288562500318X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Existing RGB-T salient object detection methods primarily rely on the learning mechanism of neural networks to perform implicit cross-modal feature alignment, aiming to achieve complementary fusion of modal features. However, this implicit feature alignment method has two main limitations: first, it is prone to causing loss of the salient object’s structural information; second, it may lead to abnormal activation responses that are not related to the object. To address the above issues, we propose the innovative Explicit Semantic Alignment (ESA) framework and design the Explicit Semantic Alignment Network for RGB-T Salient Object Detection with Hierarchical Cross-Modal Fusion (ESANet). Specifically, we design a Saliency-Aware Refinement Module (SARM), which fuses high-level semantic features with mid-level spatial details through cross-aggregation and the dynamic integration module to achieve bidirectional interaction and adaptive fusion of cross-modal features. It also utilizes a cross-modal multi-head attention mechanism to generate fine-grained shared semantic information. Subsequently, the Cross-Modal Feature Alignment Module (CFAM) introduces a window-based attention propagation mechanism, which enforces consistency in scene understanding between RGB and thermal modalities by using shared semantics as an alignment constraint. Finally, the Semantic-Guided Edge Sharpening Module (SESM) combines shared semantics with a weight enhancement strategy to optimize the consistency of shallow cross-modal feature distributions. Experimental results demonstrate that ESANet significantly outperforms existing state-of-the-art RGB-T salient object detection methods on three public datasets, validating its excellent performance in salient object detection tasks. Our code will be released at https://github.com/whklearn/ESANet.git.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.