EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2024-02-13 DOI:10.1109/TNNLS.2024.3358858

Geng Chen;Qingyue Wang;Bo Dong;Ruitao Ma;Nian Liu;Huazhu Fu;Yong Xia

{"title":"EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection","authors":"Geng Chen;Qingyue Wang;Bo Dong;Ruitao Ma;Nian Liu;Huazhu Fu;Yong Xia","doi":"10.1109/TNNLS.2024.3358858","DOIUrl":null,"url":null,"abstract":"RGB-D salient object detection (SOD) has gained tremendous attention in recent years. In particular, transformer has been employed and shown great potential. However, existing transformer models usually overlook the vital edge information, which is a major issue restricting the further improvement of SOD accuracy. To this end, we propose a novel edge-aware RGB-D SOD transformer, called EM-Trans, which explicitly models the edge information in a dual-band decomposition framework. Specifically, we employ two parallel decoder networks to learn the high-frequency edge and low-frequency body features from the low- and high-level features extracted from a two-steam multimodal backbone network, respectively. Next, we propose a cross-attention complementarity exploration module to enrich the edge/body features by exploiting the multimodal complementarity information. The refined features are then fed into our proposed color-hint guided fusion module for enhancing the depth feature and fusing the multimodal features. Finally, the resulting features are fused using our deeply supervised progressive fusion module, which progressively integrates edge and body features for predicting saliency maps. Our model explicitly considers the edge information for accurate RGB-D SOD, overcoming the limitations of existing methods and effectively improving the performance. Extensive experiments on benchmark datasets demonstrate that EM-Trans is an effective RGB-D SOD framework that outperforms the current state-of-the-art models, both quantitatively and qualitatively. A further extension to RGB-T SOD demonstrates the promising potential of our model in various kinds of multimodal SOD tasks.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 2","pages":"3175-3188"},"PeriodicalIF":8.9000,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10433541/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

RGB-D salient object detection (SOD) has gained tremendous attention in recent years. In particular, transformer has been employed and shown great potential. However, existing transformer models usually overlook the vital edge information, which is a major issue restricting the further improvement of SOD accuracy. To this end, we propose a novel edge-aware RGB-D SOD transformer, called EM-Trans, which explicitly models the edge information in a dual-band decomposition framework. Specifically, we employ two parallel decoder networks to learn the high-frequency edge and low-frequency body features from the low- and high-level features extracted from a two-steam multimodal backbone network, respectively. Next, we propose a cross-attention complementarity exploration module to enrich the edge/body features by exploiting the multimodal complementarity information. The refined features are then fed into our proposed color-hint guided fusion module for enhancing the depth feature and fusing the multimodal features. Finally, the resulting features are fused using our deeply supervised progressive fusion module, which progressively integrates edge and body features for predicting saliency maps. Our model explicitly considers the edge information for accurate RGB-D SOD, overcoming the limitations of existing methods and effectively improving the performance. Extensive experiments on benchmark datasets demonstrate that EM-Trans is an effective RGB-D SOD framework that outperforms the current state-of-the-art models, both quantitatively and qualitatively. A further extension to RGB-T SOD demonstrates the promising potential of our model in various kinds of multimodal SOD tasks.

查看原文本刊更多论文

:用于 RGB-D 突出物体检测的边缘感知多模态变换器。

近年来，RGB-D 突出物体检测（SOD）受到了广泛关注。其中，变换器被广泛应用，并显示出巨大的潜力。然而，现有的变换器模型通常会忽略重要的边缘信息，这是制约 SOD 精度进一步提高的主要问题。为此，我们提出了一种新颖的边缘感知 RGB-D SOD 变换器，即在双频分解框架中明确建立边缘信息模型的变换器。具体来说，我们采用了两个并行解码器网络，分别从双流多模态骨干网络中提取的低频和高层特征中学习高频边缘特征和低频主体特征。接下来，我们提出了一个跨注意力互补性探索模块，通过利用多模态互补性信息来丰富边缘/身体特征。然后，我们将提炼出的特征输入我们提出的颜色-提示引导融合模块，以增强深度特征并融合多模态特征。最后，利用我们的深度监督渐进式融合模块对得到的特征进行融合，该模块会逐步整合边缘和身体特征，以预测显著性图。我们的模型明确考虑了边缘信息，以实现准确的 RGB-D SOD，克服了现有方法的局限性，有效提高了性能。在基准数据集上进行的广泛实验证明，这是一个有效的 RGB-D SOD 框架，在定量和定性方面都优于目前最先进的模型。对 RGB-T SOD 的进一步扩展证明了我们的模型在各种多模态 SOD 任务中的巨大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.