用于多模态显著目标检测的轻量级三流编码器-解码器网络

IF 3.1 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation Pub Date : 2025-07-04 DOI:10.1016/j.jvcir.2025.104523

Junzhe Lu , Tingyu Wang , Bin Wan , Qiang Zhao , Shuai Wang , Yaoqi Sun , Yang Zhou , Chenggang Yan

{"title":"用于多模态显著目标检测的轻量级三流编码器-解码器网络","authors":"Junzhe Lu , Tingyu Wang , Bin Wan , Qiang Zhao , Shuai Wang , Yaoqi Sun , Yang Zhou , Chenggang Yan","doi":"10.1016/j.jvcir.2025.104523","DOIUrl":null,"url":null,"abstract":"<div><div>Salient object detection technique can identify the most attractive objects in a scene. In recent years, multi-modal salient object detection (SOD) has shown promising prospects. However, most of the existing multi-modal SOD models ignore modal size and computational cost in pursuit of comprehensive cross-modality feature fusion. To enhance the feasibility of high accuracy model in practical applications, we propose a Lightweight Three-stream Encoder–Decoder Network (TENet) for multi-modal salient object detection. Specifically, we design three decoders to explore saliency clues embedded in different multi-modal features and leverage a hierarchical decoding structure to alleviate the negative effects of low-quality images. To reduce the difference among modalities, we propose a lightweight modal information-guided fusion (MIGF) module to enhance the correlation between RGB-D and RGB-T modalities, thus laying the groundwork for triple-modal fusion. Furthermore, to utilize multi-scale information, we propose the semantic interaction (SI) module and the semantic feature enhancement (SFE) module to integrate specific hierarchical information embedded in high- and low-level features. Extensive experiments on the VDT-2048 dataset show that TENet has a model size of 37 MB, an inference speed of 38FPS, and achieves comparable accuracy to 16 state-of-the-art multi-modal methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104523"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lightweight three-stream encoder–decoder network for multi-modal salient object detection\",\"authors\":\"Junzhe Lu , Tingyu Wang , Bin Wan , Qiang Zhao , Shuai Wang , Yaoqi Sun , Yang Zhou , Chenggang Yan\",\"doi\":\"10.1016/j.jvcir.2025.104523\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Salient object detection technique can identify the most attractive objects in a scene. In recent years, multi-modal salient object detection (SOD) has shown promising prospects. However, most of the existing multi-modal SOD models ignore modal size and computational cost in pursuit of comprehensive cross-modality feature fusion. To enhance the feasibility of high accuracy model in practical applications, we propose a Lightweight Three-stream Encoder–Decoder Network (TENet) for multi-modal salient object detection. Specifically, we design three decoders to explore saliency clues embedded in different multi-modal features and leverage a hierarchical decoding structure to alleviate the negative effects of low-quality images. To reduce the difference among modalities, we propose a lightweight modal information-guided fusion (MIGF) module to enhance the correlation between RGB-D and RGB-T modalities, thus laying the groundwork for triple-modal fusion. Furthermore, to utilize multi-scale information, we propose the semantic interaction (SI) module and the semantic feature enhancement (SFE) module to integrate specific hierarchical information embedded in high- and low-level features. Extensive experiments on the VDT-2048 dataset show that TENet has a model size of 37 MB, an inference speed of 38FPS, and achieves comparable accuracy to 16 state-of-the-art multi-modal methods.</div></div>\",\"PeriodicalId\":54755,\"journal\":{\"name\":\"Journal of Visual Communication and Image Representation\",\"volume\":\"111 \",\"pages\":\"Article 104523\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Visual Communication and Image Representation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1047320325001373\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325001373","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

显著目标检测技术可以识别出场景中最吸引人的物体。近年来，多模态显著目标检测（SOD）显示出良好的应用前景。然而，现有的多模态SOD模型在追求全面的跨模态特征融合时，大多忽略了模态尺寸和计算成本。为了提高高精度模型在实际应用中的可行性，我们提出了一种用于多模态显著目标检测的轻量级三流编码器-解码器网络（TENet）。具体来说，我们设计了三个解码器来探索嵌入在不同多模态特征中的显著性线索，并利用分层解码结构来减轻低质量图像的负面影响。为了减少模态之间的差异，我们提出了一种轻量级模态信息引导融合（MIGF）模块，以增强RGB-D和RGB-T模态之间的相关性，从而为三模态融合奠定基础。此外，为了利用多尺度信息，我们提出了语义交互（SI）模块和语义特征增强（SFE）模块来整合嵌入在高、低层特征中的特定层次信息。在VDT-2048数据集上的大量实验表明，TENet的模型大小为37 MB，推理速度为38FPS，并且达到了与16种最先进的多模态方法相当的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Lightweight three-stream encoder–decoder network for multi-modal salient object detection

Salient object detection technique can identify the most attractive objects in a scene. In recent years, multi-modal salient object detection (SOD) has shown promising prospects. However, most of the existing multi-modal SOD models ignore modal size and computational cost in pursuit of comprehensive cross-modality feature fusion. To enhance the feasibility of high accuracy model in practical applications, we propose a Lightweight Three-stream Encoder–Decoder Network (TENet) for multi-modal salient object detection. Specifically, we design three decoders to explore saliency clues embedded in different multi-modal features and leverage a hierarchical decoding structure to alleviate the negative effects of low-quality images. To reduce the difference among modalities, we propose a lightweight modal information-guided fusion (MIGF) module to enhance the correlation between RGB-D and RGB-T modalities, thus laying the groundwork for triple-modal fusion. Furthermore, to utilize multi-scale information, we propose the semantic interaction (SI) module and the semantic feature enhancement (SFE) module to integrate specific hierarchical information embedded in high- and low-level features. Extensive experiments on the VDT-2048 dataset show that TENet has a model size of 37 MB, an inference speed of 38FPS, and achieves comparable accuracy to 16 state-of-the-art multi-modal methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Visual Communication and Image Representation 工程技术-计算机：软件工程

CiteScore

5.40

自引率

11.50%

发文量

188

审稿时长

9.9 months

期刊介绍： The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.