{"title":"Lightweight RGB-D Salient Object Detection From a Speed-Accuracy Tradeoff Perspective","authors":"Songsong Duan;Xi Yang;Nannan Wang;Xinbo Gao","doi":"10.1109/TIP.2025.3560488","DOIUrl":null,"url":null,"abstract":"Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency. Meanwhile, several existing lightweight methods are difficult to achieve high-precision performance. To balance the efficiency and performance, we propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives: depth quality, modality fusion, and feature representation. Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps,which effectively alleviates the multi-modal gaps in the current datasets. For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities. Here, the multi-modal features are decoupled into dual-view feature vectors to project discriminable information of feature maps. For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework to enlarge the limited feature space generated by the lightweight backbones. DIRM models texture features and saliency features to enrich feature space, and employ two-way prediction heads to optimal its parameters through a bi-directional backpropagation. Finally, we design a Dual Feature Aggregation Module (DFAM) in the decoder to aggregate texture and saliency features. Extensive experiments on five public RGB-D SOD datasets indicate that the proposed SATNet excels state-of-the-art (SOTA) CNN-based heavyweight models and achieves a lightweight framework with 5.2 M parameters and 415 FPS. The code is available at <uri>https://github.com/duan-song/SATNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2529-2543"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10970446/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency. Meanwhile, several existing lightweight methods are difficult to achieve high-precision performance. To balance the efficiency and performance, we propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives: depth quality, modality fusion, and feature representation. Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps,which effectively alleviates the multi-modal gaps in the current datasets. For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities. Here, the multi-modal features are decoupled into dual-view feature vectors to project discriminable information of feature maps. For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework to enlarge the limited feature space generated by the lightweight backbones. DIRM models texture features and saliency features to enrich feature space, and employ two-way prediction heads to optimal its parameters through a bi-directional backpropagation. Finally, we design a Dual Feature Aggregation Module (DFAM) in the decoder to aggregate texture and saliency features. Extensive experiments on five public RGB-D SOD datasets indicate that the proposed SATNet excels state-of-the-art (SOTA) CNN-based heavyweight models and achieves a lightweight framework with 5.2 M parameters and 415 FPS. The code is available at https://github.com/duan-song/SATNet