{"title":"基于模态感知提示的高效RGB-D共显著目标检测","authors":"Zhangping Tu;Xiaohong Qian;Wujie Zhou","doi":"10.1109/TASE.2025.3543586","DOIUrl":null,"url":null,"abstract":"RGB-D co-salient object detection (Co-SOD) aims to identify and segment co-occurring salient objects in a set of correlated images and depth maps. Most existing RGB-D Co-SOD methods fully fine-tune the dual-stream encoder-decoder architecture and fuse the RGB and depth features using a complex feature fusion strategy, which is expensive to train owing to the large number of parameters that need to be updated during the feature extraction and fusion process. In addition, current methods do not pay sufficient attention to differentiate co- salient information from non-co- salient information effectively. This interfering information affects the localization of co-salient targets. Therefore, this study proposes a simple and effective modality-aware prompting network (MAPNet) for efficient RGB-D Co-SOD. MAPNet mainly performs RGB-D Co-SOD through two approaches, namely modal fusion and consensus feature extraction, using a multimodal prompt generator (MPG) and consensus feature extraction module (CFEM), respectively. Specifically, the MPG module guides the depth features in the fine-tuned backbone network from the RGB features obtained in the frozen backbone network for fusion in hyperbolic spaces to generate multilevel modal cues that are subsequently injected into the fine-tuned backbone network for efficient modal fusion. The CFEM uses RGB features to generate an image salient prior, combines the salient prior with the highest level of fusion features to obtain the central point, and uses the salient features closer to the central point as the consensus features of the image group. In addition, contrast loss is introduced to separate the synergistic and non-synergistic salient features to obtain pure co-salient features. The trained MAPNet delivered state-of-the-art performance on three benchmark datasets (RGB-D CoSal1k, RGB-D CoSal150, and RGB-D CoSeg183), with the structure-measure improved by 2.1% on the RGB-D CoSeg183 dataset. The codes are available at <uri>https://github.com/trumpetor/MAPNet</uri>. Note to Practitioners—This study presents a straightforward and effective modality-aware prompting network (MAPNet) designed for efficient RGB-D Co-SOD. Initially, the MPG module of MAPNet enables the effective integration of RGB and depth modalities through prompt learning. Subsequently, the CFEM employs the pixel group centroid proxy and top-k selection mechanism to extract high-level integrated features and salient prior consensus features, which serve as coordinated saliency features for image groups. Finally, the coordinated obtained salient and integrated features are input into the decoder to generate predictions.","PeriodicalId":51060,"journal":{"name":"IEEE Transactions on Automation Science and Engineering","volume":"22 ","pages":"12911-12921"},"PeriodicalIF":6.4000,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient RGB-D Co-Salient Object Detection via Modality-Aware Prompting\",\"authors\":\"Zhangping Tu;Xiaohong Qian;Wujie Zhou\",\"doi\":\"10.1109/TASE.2025.3543586\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"RGB-D co-salient object detection (Co-SOD) aims to identify and segment co-occurring salient objects in a set of correlated images and depth maps. Most existing RGB-D Co-SOD methods fully fine-tune the dual-stream encoder-decoder architecture and fuse the RGB and depth features using a complex feature fusion strategy, which is expensive to train owing to the large number of parameters that need to be updated during the feature extraction and fusion process. In addition, current methods do not pay sufficient attention to differentiate co- salient information from non-co- salient information effectively. This interfering information affects the localization of co-salient targets. Therefore, this study proposes a simple and effective modality-aware prompting network (MAPNet) for efficient RGB-D Co-SOD. MAPNet mainly performs RGB-D Co-SOD through two approaches, namely modal fusion and consensus feature extraction, using a multimodal prompt generator (MPG) and consensus feature extraction module (CFEM), respectively. Specifically, the MPG module guides the depth features in the fine-tuned backbone network from the RGB features obtained in the frozen backbone network for fusion in hyperbolic spaces to generate multilevel modal cues that are subsequently injected into the fine-tuned backbone network for efficient modal fusion. The CFEM uses RGB features to generate an image salient prior, combines the salient prior with the highest level of fusion features to obtain the central point, and uses the salient features closer to the central point as the consensus features of the image group. In addition, contrast loss is introduced to separate the synergistic and non-synergistic salient features to obtain pure co-salient features. The trained MAPNet delivered state-of-the-art performance on three benchmark datasets (RGB-D CoSal1k, RGB-D CoSal150, and RGB-D CoSeg183), with the structure-measure improved by 2.1% on the RGB-D CoSeg183 dataset. The codes are available at <uri>https://github.com/trumpetor/MAPNet</uri>. Note to Practitioners—This study presents a straightforward and effective modality-aware prompting network (MAPNet) designed for efficient RGB-D Co-SOD. Initially, the MPG module of MAPNet enables the effective integration of RGB and depth modalities through prompt learning. Subsequently, the CFEM employs the pixel group centroid proxy and top-k selection mechanism to extract high-level integrated features and salient prior consensus features, which serve as coordinated saliency features for image groups. Finally, the coordinated obtained salient and integrated features are input into the decoder to generate predictions.\",\"PeriodicalId\":51060,\"journal\":{\"name\":\"IEEE Transactions on Automation Science and Engineering\",\"volume\":\"22 \",\"pages\":\"12911-12921\"},\"PeriodicalIF\":6.4000,\"publicationDate\":\"2025-02-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Automation Science and Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10892262/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Automation Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10892262/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
RGB-D共显著目标检测(Co-SOD)旨在识别和分割一组相关图像和深度图中共同出现的显著目标。大多数现有的RGB- d Co-SOD方法对双流编码器架构进行了充分的微调,并使用复杂的特征融合策略融合RGB和深度特征,由于在特征提取和融合过程中需要更新大量参数,因此训练成本很高。此外,目前的方法对有效区分共显著信息和非共显著信息没有给予足够的重视。这些干扰信息影响了共突出目标的定位。因此,本研究提出了一种简单有效的模式感知提示网络(MAPNet),用于高效的RGB-D Co-SOD。MAPNet主要通过模态融合和共识特征提取两种方法进行RGB-D Co-SOD,分别使用多模态提示生成器(MPG)和共识特征提取模块(CFEM)。具体来说,MPG模块从冻结骨干网中获得的RGB特征中引导微调骨干网中的深度特征,用于在双曲空间中融合,以生成多层模态线索,随后注入微调骨干网中进行有效的模态融合。CFEM利用RGB特征生成图像显著先验,将显著先验与最高级别的融合特征结合得到图像中心点,并将更靠近中心点的显著特征作为图像组的共识特征。此外,引入对比度损失分离协同显著特征和非协同显著特征,得到纯共显著特征。经过训练的MAPNet在三个基准数据集(RGB-D CoSal1k、RGB-D CoSal150和RGB-D CoSeg183)上提供了最先进的性能,在RGB-D CoSeg183数据集上,结构测量提高了2.1%。代码可在https://github.com/trumpetor/MAPNet上获得。本研究提出了一个简单有效的模式感知提示网络(MAPNet),旨在有效地实现RGB-D Co-SOD。最初,MAPNet的MPG模块通过快速学习实现了RGB和深度模式的有效整合。随后,CFEM采用像素群质心代理和top-k选择机制提取高水平集成特征和显著性先验共识特征,作为图像组的协调显著性特征。最后,将协调得到的显著特征和综合特征输入到解码器中进行预测。
Efficient RGB-D Co-Salient Object Detection via Modality-Aware Prompting
RGB-D co-salient object detection (Co-SOD) aims to identify and segment co-occurring salient objects in a set of correlated images and depth maps. Most existing RGB-D Co-SOD methods fully fine-tune the dual-stream encoder-decoder architecture and fuse the RGB and depth features using a complex feature fusion strategy, which is expensive to train owing to the large number of parameters that need to be updated during the feature extraction and fusion process. In addition, current methods do not pay sufficient attention to differentiate co- salient information from non-co- salient information effectively. This interfering information affects the localization of co-salient targets. Therefore, this study proposes a simple and effective modality-aware prompting network (MAPNet) for efficient RGB-D Co-SOD. MAPNet mainly performs RGB-D Co-SOD through two approaches, namely modal fusion and consensus feature extraction, using a multimodal prompt generator (MPG) and consensus feature extraction module (CFEM), respectively. Specifically, the MPG module guides the depth features in the fine-tuned backbone network from the RGB features obtained in the frozen backbone network for fusion in hyperbolic spaces to generate multilevel modal cues that are subsequently injected into the fine-tuned backbone network for efficient modal fusion. The CFEM uses RGB features to generate an image salient prior, combines the salient prior with the highest level of fusion features to obtain the central point, and uses the salient features closer to the central point as the consensus features of the image group. In addition, contrast loss is introduced to separate the synergistic and non-synergistic salient features to obtain pure co-salient features. The trained MAPNet delivered state-of-the-art performance on three benchmark datasets (RGB-D CoSal1k, RGB-D CoSal150, and RGB-D CoSeg183), with the structure-measure improved by 2.1% on the RGB-D CoSeg183 dataset. The codes are available at https://github.com/trumpetor/MAPNet. Note to Practitioners—This study presents a straightforward and effective modality-aware prompting network (MAPNet) designed for efficient RGB-D Co-SOD. Initially, the MPG module of MAPNet enables the effective integration of RGB and depth modalities through prompt learning. Subsequently, the CFEM employs the pixel group centroid proxy and top-k selection mechanism to extract high-level integrated features and salient prior consensus features, which serve as coordinated saliency features for image groups. Finally, the coordinated obtained salient and integrated features are input into the decoder to generate predictions.
期刊介绍:
The IEEE Transactions on Automation Science and Engineering (T-ASE) publishes fundamental papers on Automation, emphasizing scientific results that advance efficiency, quality, productivity, and reliability. T-ASE encourages interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, operations research, and other fields. T-ASE welcomes results relevant to industries such as agriculture, biotechnology, healthcare, home automation, maintenance, manufacturing, pharmaceuticals, retail, security, service, supply chains, and transportation. T-ASE addresses a research community willing to integrate knowledge across disciplines and industries. For this purpose, each paper includes a Note to Practitioners that summarizes how its results can be applied or how they might be extended to apply in practice.