Liuxin Bao;Xiaofei Zhou;Bolun Zheng;Runmin Cong;Haibing Yin;Jiyong Zhang;Chenggang Yan
{"title":"IFENet:用于 V-D-T 突出物体检测的交互、融合和增强网络","authors":"Liuxin Bao;Xiaofei Zhou;Bolun Zheng;Runmin Cong;Haibing Yin;Jiyong Zhang;Chenggang Yan","doi":"10.1109/TIP.2025.3527372","DOIUrl":null,"url":null,"abstract":"Visible-depth-thermal (VDT) salient object detection (SOD) aims to highlight the most visually attractive object by utilizing the triple-modal cues. However, existing models don’t give sufficient exploration of the multi-modal correlations and differentiation, which leads to unsatisfactory detection performance. In this paper, we propose an interaction, fusion, and enhancement network (IFENet) to conduct the VDT SOD task, which contains three key steps including the multi-modal interaction, the multi-modal fusion, and the spatial enhancement. Specifically, embarking on the Transformer backbone, our IFENet can acquire multi-scale multi-modal features. Firstly, the inter-modal and intra-modal graph-based interaction (IIGI) module is deployed to explore inter-modal channel correlation and intra-modal long-term spatial dependency. Secondly, the gated attention-based fusion (GAF) module is employed to purify and aggregate the triple-modal features, where multi-modal features are filtered along spatial, channel, and modality dimensions, respectively. Lastly, the frequency split-based enhancement (FSE) module separates the fused feature into high-frequency and low-frequency components to enhance spatial information (i.e., boundary details and object location) of the salient object. Extensive experiments are performed on VDT-2048 dataset, and the results show that our saliency model consistently outperforms 13 state-of-the-art models. Our code and results are available at <uri>https://github.com/Lx-Bao/IFENet</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"483-494"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"IFENet: Interaction, Fusion, and Enhancement Network for V-D-T Salient Object Detection\",\"authors\":\"Liuxin Bao;Xiaofei Zhou;Bolun Zheng;Runmin Cong;Haibing Yin;Jiyong Zhang;Chenggang Yan\",\"doi\":\"10.1109/TIP.2025.3527372\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visible-depth-thermal (VDT) salient object detection (SOD) aims to highlight the most visually attractive object by utilizing the triple-modal cues. However, existing models don’t give sufficient exploration of the multi-modal correlations and differentiation, which leads to unsatisfactory detection performance. In this paper, we propose an interaction, fusion, and enhancement network (IFENet) to conduct the VDT SOD task, which contains three key steps including the multi-modal interaction, the multi-modal fusion, and the spatial enhancement. Specifically, embarking on the Transformer backbone, our IFENet can acquire multi-scale multi-modal features. Firstly, the inter-modal and intra-modal graph-based interaction (IIGI) module is deployed to explore inter-modal channel correlation and intra-modal long-term spatial dependency. Secondly, the gated attention-based fusion (GAF) module is employed to purify and aggregate the triple-modal features, where multi-modal features are filtered along spatial, channel, and modality dimensions, respectively. Lastly, the frequency split-based enhancement (FSE) module separates the fused feature into high-frequency and low-frequency components to enhance spatial information (i.e., boundary details and object location) of the salient object. Extensive experiments are performed on VDT-2048 dataset, and the results show that our saliency model consistently outperforms 13 state-of-the-art models. Our code and results are available at <uri>https://github.com/Lx-Bao/IFENet</uri>.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"483-494\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10841969/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10841969/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
IFENet: Interaction, Fusion, and Enhancement Network for V-D-T Salient Object Detection
Visible-depth-thermal (VDT) salient object detection (SOD) aims to highlight the most visually attractive object by utilizing the triple-modal cues. However, existing models don’t give sufficient exploration of the multi-modal correlations and differentiation, which leads to unsatisfactory detection performance. In this paper, we propose an interaction, fusion, and enhancement network (IFENet) to conduct the VDT SOD task, which contains three key steps including the multi-modal interaction, the multi-modal fusion, and the spatial enhancement. Specifically, embarking on the Transformer backbone, our IFENet can acquire multi-scale multi-modal features. Firstly, the inter-modal and intra-modal graph-based interaction (IIGI) module is deployed to explore inter-modal channel correlation and intra-modal long-term spatial dependency. Secondly, the gated attention-based fusion (GAF) module is employed to purify and aggregate the triple-modal features, where multi-modal features are filtered along spatial, channel, and modality dimensions, respectively. Lastly, the frequency split-based enhancement (FSE) module separates the fused feature into high-frequency and low-frequency components to enhance spatial information (i.e., boundary details and object location) of the salient object. Extensive experiments are performed on VDT-2048 dataset, and the results show that our saliency model consistently outperforms 13 state-of-the-art models. Our code and results are available at https://github.com/Lx-Bao/IFENet.