Explicit Semantic Alignment Network for RGB-T salient object detection with Hierarchical Cross-Modal Fusion

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-09-10 DOI:10.1016/j.imavis.2025.105730

Hongkuan Wang, Qingxi Yu, Zhenguang Di, Gang Yang

{"title":"Explicit Semantic Alignment Network for RGB-T salient object detection with Hierarchical Cross-Modal Fusion","authors":"Hongkuan Wang, Qingxi Yu, Zhenguang Di, Gang Yang","doi":"10.1016/j.imavis.2025.105730","DOIUrl":null,"url":null,"abstract":"<div><div>Existing RGB-T salient object detection methods primarily rely on the learning mechanism of neural networks to perform implicit cross-modal feature alignment, aiming to achieve complementary fusion of modal features. However, this implicit feature alignment method has two main limitations: first, it is prone to causing loss of the salient object’s structural information; second, it may lead to abnormal activation responses that are not related to the object. To address the above issues, we propose the innovative Explicit Semantic Alignment (ESA) framework and design the Explicit Semantic Alignment Network for RGB-T Salient Object Detection with Hierarchical Cross-Modal Fusion (ESANet). Specifically, we design a Saliency-Aware Refinement Module (SARM), which fuses high-level semantic features with mid-level spatial details through cross-aggregation and the dynamic integration module to achieve bidirectional interaction and adaptive fusion of cross-modal features. It also utilizes a cross-modal multi-head attention mechanism to generate fine-grained shared semantic information. Subsequently, the Cross-Modal Feature Alignment Module (CFAM) introduces a window-based attention propagation mechanism, which enforces consistency in scene understanding between RGB and thermal modalities by using shared semantics as an alignment constraint. Finally, the Semantic-Guided Edge Sharpening Module (SESM) combines shared semantics with a weight enhancement strategy to optimize the consistency of shallow cross-modal feature distributions. Experimental results demonstrate that ESANet significantly outperforms existing state-of-the-art RGB-T salient object detection methods on three public datasets, validating its excellent performance in salient object detection tasks. Our code will be released at <span><span>https://github.com/whklearn/ESANet.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105730"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S026288562500318X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Existing RGB-T salient object detection methods primarily rely on the learning mechanism of neural networks to perform implicit cross-modal feature alignment, aiming to achieve complementary fusion of modal features. However, this implicit feature alignment method has two main limitations: first, it is prone to causing loss of the salient object’s structural information; second, it may lead to abnormal activation responses that are not related to the object. To address the above issues, we propose the innovative Explicit Semantic Alignment (ESA) framework and design the Explicit Semantic Alignment Network for RGB-T Salient Object Detection with Hierarchical Cross-Modal Fusion (ESANet). Specifically, we design a Saliency-Aware Refinement Module (SARM), which fuses high-level semantic features with mid-level spatial details through cross-aggregation and the dynamic integration module to achieve bidirectional interaction and adaptive fusion of cross-modal features. It also utilizes a cross-modal multi-head attention mechanism to generate fine-grained shared semantic information. Subsequently, the Cross-Modal Feature Alignment Module (CFAM) introduces a window-based attention propagation mechanism, which enforces consistency in scene understanding between RGB and thermal modalities by using shared semantics as an alignment constraint. Finally, the Semantic-Guided Edge Sharpening Module (SESM) combines shared semantics with a weight enhancement strategy to optimize the consistency of shallow cross-modal feature distributions. Experimental results demonstrate that ESANet significantly outperforms existing state-of-the-art RGB-T salient object detection methods on three public datasets, validating its excellent performance in salient object detection tasks. Our code will be released at https://github.com/whklearn/ESANet.git.

查看原文本刊更多论文

基于层次跨模态融合的RGB-T显著目标检测显式语义对齐网络

现有的RGB-T显著目标检测方法主要依靠神经网络的学习机制进行隐式跨模态特征对齐，以实现模态特征的互补融合。然而，这种隐式特征对齐方法有两个主要的局限性：一是容易造成显著目标结构信息的丢失；其次，它可能导致与对象无关的异常激活反应。为了解决上述问题，我们提出了创新的显式语义对齐（ESA）框架，并设计了基于分层跨模态融合的RGB-T显著目标检测（ESANet）的显式语义对齐网络。具体而言，我们设计了显著性感知细化模块（SARM），通过交叉聚合将高层语义特征与中层空间细节融合；设计了动态集成模块，实现跨模态特征的双向交互和自适应融合。它还利用跨模态多头注意机制生成细粒度的共享语义信息。随后，跨模态特征对齐模块（CFAM）引入了一种基于窗口的注意力传播机制，该机制通过使用共享语义作为对齐约束来强制RGB和热模态之间的场景理解一致性。最后，语义引导边缘锐化模块（SESM）结合共享语义和权值增强策略来优化浅交叉模态特征分布的一致性。实验结果表明，ESANet在三个公共数据集上显著优于现有最先进的RGB-T显著目标检测方法，验证了其在显著目标检测任务中的优异性能。我们的代码将在https://github.com/whklearn/ESANet.git上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.