{"title":"ADMNet: Attention-Guided Densely Multi-Scale Network for Lightweight Salient Object Detection","authors":"Xiaofei Zhou;Kunye Shen;Zhi Liu","doi":"10.1109/TMM.2024.3413529","DOIUrl":null,"url":null,"abstract":"Recently, benefitting from the rapid development of deep learning technology, the research of salient object detection has achieved great progress. However, the performance of existing cutting-edge saliency models relies on large network size and high computational overhead. This is unamiable to real-world applications, especially the practical platforms with low cost and limited computing resources. In this paper, we propose a novel lightweight saliency model, namely Attention-guided Densely Multi-scale Network (ADMNet), to tackle this issue. Firstly, we design the multi-scale perception (MP) module to acquire different contextual features by using different receptive fields. Embarking on MP module, we build the encoder of our model, where each convolutional block adopts a dense structure to connect MP modules. Following this way, our model can provide powerful encoder features for the characterization of salient objects. Secondly, we employ dual attention (DA) module to equip the decoder blocks. Particularly, in DA module, the binarized coarse saliency inference of the decoder block (\n<italic>i.e.</i>\n, a hard spatial attention map) is first employed to filter out interference cues from the decoder feature, and then by introducing large receptive fields, the enhanced decoder feature is used to generate a soft spatial attention map, which further purifies the fused features. Following this way, the deep features are steered to give more concerns to salient regions. Extensive experiments on five public challenging datasets including ECSSD, DUT-OMRON, DUTS-TE, HKU-IS, and PASCAL-S clearly show that our model achieves comparable performance with the state-of-the-art saliency models while running at a 219.4fps GPU speed and a 1.76fps CPU speed for a 368×368 image with only 0.84 M parameters.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10828-10841"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10555313/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, benefitting from the rapid development of deep learning technology, the research of salient object detection has achieved great progress. However, the performance of existing cutting-edge saliency models relies on large network size and high computational overhead. This is unamiable to real-world applications, especially the practical platforms with low cost and limited computing resources. In this paper, we propose a novel lightweight saliency model, namely Attention-guided Densely Multi-scale Network (ADMNet), to tackle this issue. Firstly, we design the multi-scale perception (MP) module to acquire different contextual features by using different receptive fields. Embarking on MP module, we build the encoder of our model, where each convolutional block adopts a dense structure to connect MP modules. Following this way, our model can provide powerful encoder features for the characterization of salient objects. Secondly, we employ dual attention (DA) module to equip the decoder blocks. Particularly, in DA module, the binarized coarse saliency inference of the decoder block (
i.e.
, a hard spatial attention map) is first employed to filter out interference cues from the decoder feature, and then by introducing large receptive fields, the enhanced decoder feature is used to generate a soft spatial attention map, which further purifies the fused features. Following this way, the deep features are steered to give more concerns to salient regions. Extensive experiments on five public challenging datasets including ECSSD, DUT-OMRON, DUTS-TE, HKU-IS, and PASCAL-S clearly show that our model achieves comparable performance with the state-of-the-art saliency models while running at a 219.4fps GPU speed and a 1.76fps CPU speed for a 368×368 image with only 0.84 M parameters.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.