Multi-scale subtraction and attention-guided network for RGB-D indoor scene parsing

IF 3.4 2区工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Displays Pub Date : 2025-08-25 DOI:10.1016/j.displa.2025.103188

Wen Xie, Heng Liu

{"title":"Multi-scale subtraction and attention-guided network for RGB-D indoor scene parsing","authors":"Wen Xie, Heng Liu","doi":"10.1016/j.displa.2025.103188","DOIUrl":null,"url":null,"abstract":"<div><div>RGB-D scene parsing is a fundamental task in computer vision. However, the lower quality of depth images often leads to less accurate feature representations in the depth branch. Additionally, existing multi-level fusion methods in decoders typically use a unified module to merge RGB and depth features, disregarding the unique characteristics of hierarchical features. This indiscriminate approach can degrade segmentation accuracy. Hence, we propose a Multi-Scale Subtraction and Attention-Guided Network (MSANet). Firstly, through a cross-modal fusion module, we fuse RGB and depth features along the horizontal and vertical directions to capture positional information between the two modalities. Then, we use a Spatial Fusion Unit to adaptively enhance depth and RGB features spatially. Furthermore, we analyze the feature differences across various decoder levels and divide them into spatial and semantic branches. In the semantic branch, a high-level cross-modal fusion module extracts deep semantic information from adjacent high-level features through backpropagation, enabling RGB and depth layer reconstruction and mitigating information disparity and hierarchical differences with subtraction operations. In the spatial branch, a low-level cross-modal fusion module leverages spatial attention to enhance regional accuracy and reduce noise. MSANet achieves 52.0% mIoU on the NYU Depth v2 dataset, outperforming the baseline by 5.1%. On the more challenging SUN RGB-D dataset, MSANet achieves 49.0% mIoU. On the ScanNetV2 dataset, MSANet achieves 60.0% mIoU, further validating its effectiveness in complex indoor scenes.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"91 ","pages":"Article 103188"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225002252","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

RGB-D scene parsing is a fundamental task in computer vision. However, the lower quality of depth images often leads to less accurate feature representations in the depth branch. Additionally, existing multi-level fusion methods in decoders typically use a unified module to merge RGB and depth features, disregarding the unique characteristics of hierarchical features. This indiscriminate approach can degrade segmentation accuracy. Hence, we propose a Multi-Scale Subtraction and Attention-Guided Network (MSANet). Firstly, through a cross-modal fusion module, we fuse RGB and depth features along the horizontal and vertical directions to capture positional information between the two modalities. Then, we use a Spatial Fusion Unit to adaptively enhance depth and RGB features spatially. Furthermore, we analyze the feature differences across various decoder levels and divide them into spatial and semantic branches. In the semantic branch, a high-level cross-modal fusion module extracts deep semantic information from adjacent high-level features through backpropagation, enabling RGB and depth layer reconstruction and mitigating information disparity and hierarchical differences with subtraction operations. In the spatial branch, a low-level cross-modal fusion module leverages spatial attention to enhance regional accuracy and reduce noise. MSANet achieves 52.0% mIoU on the NYU Depth v2 dataset, outperforming the baseline by 5.1%. On the more challenging SUN RGB-D dataset, MSANet achieves 49.0% mIoU. On the ScanNetV2 dataset, MSANet achieves 60.0% mIoU, further validating its effectiveness in complex indoor scenes.

查看原文本刊更多论文

RGB-D室内场景分析的多尺度减法和注意力引导网络

RGB-D场景解析是计算机视觉中的一项基本任务。然而，较低的深度图像质量往往导致深度分支的特征表示精度较低。此外，解码器中现有的多级融合方法通常使用统一的模块来合并RGB和深度特征，而忽略了层次特征的独特性。这种不分青红皂白的方法会降低分割的准确性。因此，我们提出了一个多尺度减法和注意引导网络（MSANet）。首先，通过跨模态融合模块，沿水平和垂直方向融合RGB和深度特征，获取两模态之间的位置信息；然后，我们使用空间融合单元自适应增强深度和RGB特征的空间。此外，我们分析了不同解码器层次的特征差异，并将其划分为空间分支和语义分支。在语义分支中，高层跨模态融合模块通过反向传播从相邻高层特征中提取深层语义信息，实现RGB和深度层重构，并通过减法操作缓解信息差异和层次差异。在空间分支中，低水平的跨模态融合模块利用空间注意力来提高区域精度并降低噪声。MSANet在NYU Depth v2数据集上实现了52.0%的mIoU，比基线高出5.1%。在更具挑战性的SUN RGB-D数据集上，MSANet达到了49.0%的mIoU。在ScanNetV2数据集上，MSANet达到了60.0%的mIoU，进一步验证了其在复杂室内场景中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Displays 工程技术-工程：电子与电气

CiteScore

4.60

自引率

25.60%

发文量

138

审稿时长

92 days

期刊介绍： Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.