Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-02-14 DOI:10.1109/TIP.2024.3364022

Hao Chen;Feihong Shen;Ding Ding;Yongjian Deng;Chao Li

{"title":"Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond","authors":"Hao Chen;Feihong Shen;Ding Ding;Yongjian Deng;Chao Li","doi":"10.1109/TIP.2024.3364022","DOIUrl":null,"url":null,"abstract":"Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10436554/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.

查看原文本刊更多论文

用于 RGB-D 突出物体检测及其他的分离式跨模态变换器

以往用于 RGB-D 突出物体检测（SOD）的多模态变换器通常直接连接两种模态的所有贴片来建立跨模态相关性模型，并不加区分地执行多模态组合，这会导致融合混乱且效率低下。相反，我们从以下两个方面来解除跨模态互补性，以减少跨模态融合的模糊性：1）上下文解除。我们认为，由于存在严重的模态差距，像以前那样建立跨模态的长程依赖关系模型是没有意义的。与此不同的是，我们建议将跨模态互补语境分解为模态内自我注意和空间对齐的模态间注意，前者用于探索全局互补理解，后者用于捕捉局部跨模态相关性。2) 表征分解。与以往无差别的跨模态表征组合不同，我们发现跨模态线索通过增强共同的辨别区域来互补，并与特定模态的亮点相互补充。在此基础上，我们在信道维度上将标记分为一致标记和私人标记，以分离多模态整合路径，并明确提升两种互补方式。通过在各层中逐步推广这一策略，所提出的分离特征金字塔模块（DFP）实现了信息丰富的跨模态跨层整合和更好的融合适应性。在大量公共数据集上进行的综合实验验证了我们的上下文和表征分离的有效性，以及与最先进模型相比的持续改进。此外，我们的跨模态注意力层次结构可以即插即用，适用于不同的骨干架构（转换器和 CNN）和下游任务，基于 CNN 的模型和 RGB-D 语义分割实验验证了这种通用能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量