{"title":"RGB-D室内场景分析的多尺度减法和注意力引导网络","authors":"Wen Xie, Heng Liu","doi":"10.1016/j.displa.2025.103188","DOIUrl":null,"url":null,"abstract":"<div><div>RGB-D scene parsing is a fundamental task in computer vision. However, the lower quality of depth images often leads to less accurate feature representations in the depth branch. Additionally, existing multi-level fusion methods in decoders typically use a unified module to merge RGB and depth features, disregarding the unique characteristics of hierarchical features. This indiscriminate approach can degrade segmentation accuracy. Hence, we propose a Multi-Scale Subtraction and Attention-Guided Network (MSANet). Firstly, through a cross-modal fusion module, we fuse RGB and depth features along the horizontal and vertical directions to capture positional information between the two modalities. Then, we use a Spatial Fusion Unit to adaptively enhance depth and RGB features spatially. Furthermore, we analyze the feature differences across various decoder levels and divide them into spatial and semantic branches. In the semantic branch, a high-level cross-modal fusion module extracts deep semantic information from adjacent high-level features through backpropagation, enabling RGB and depth layer reconstruction and mitigating information disparity and hierarchical differences with subtraction operations. In the spatial branch, a low-level cross-modal fusion module leverages spatial attention to enhance regional accuracy and reduce noise. MSANet achieves 52.0% mIoU on the NYU Depth v2 dataset, outperforming the baseline by 5.1%. On the more challenging SUN RGB-D dataset, MSANet achieves 49.0% mIoU. On the ScanNetV2 dataset, MSANet achieves 60.0% mIoU, further validating its effectiveness in complex indoor scenes.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"91 ","pages":"Article 103188"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-scale subtraction and attention-guided network for RGB-D indoor scene parsing\",\"authors\":\"Wen Xie, Heng Liu\",\"doi\":\"10.1016/j.displa.2025.103188\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>RGB-D scene parsing is a fundamental task in computer vision. However, the lower quality of depth images often leads to less accurate feature representations in the depth branch. Additionally, existing multi-level fusion methods in decoders typically use a unified module to merge RGB and depth features, disregarding the unique characteristics of hierarchical features. This indiscriminate approach can degrade segmentation accuracy. Hence, we propose a Multi-Scale Subtraction and Attention-Guided Network (MSANet). Firstly, through a cross-modal fusion module, we fuse RGB and depth features along the horizontal and vertical directions to capture positional information between the two modalities. Then, we use a Spatial Fusion Unit to adaptively enhance depth and RGB features spatially. Furthermore, we analyze the feature differences across various decoder levels and divide them into spatial and semantic branches. In the semantic branch, a high-level cross-modal fusion module extracts deep semantic information from adjacent high-level features through backpropagation, enabling RGB and depth layer reconstruction and mitigating information disparity and hierarchical differences with subtraction operations. In the spatial branch, a low-level cross-modal fusion module leverages spatial attention to enhance regional accuracy and reduce noise. MSANet achieves 52.0% mIoU on the NYU Depth v2 dataset, outperforming the baseline by 5.1%. On the more challenging SUN RGB-D dataset, MSANet achieves 49.0% mIoU. On the ScanNetV2 dataset, MSANet achieves 60.0% mIoU, further validating its effectiveness in complex indoor scenes.</div></div>\",\"PeriodicalId\":50570,\"journal\":{\"name\":\"Displays\",\"volume\":\"91 \",\"pages\":\"Article 103188\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Displays\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0141938225002252\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225002252","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Multi-scale subtraction and attention-guided network for RGB-D indoor scene parsing
RGB-D scene parsing is a fundamental task in computer vision. However, the lower quality of depth images often leads to less accurate feature representations in the depth branch. Additionally, existing multi-level fusion methods in decoders typically use a unified module to merge RGB and depth features, disregarding the unique characteristics of hierarchical features. This indiscriminate approach can degrade segmentation accuracy. Hence, we propose a Multi-Scale Subtraction and Attention-Guided Network (MSANet). Firstly, through a cross-modal fusion module, we fuse RGB and depth features along the horizontal and vertical directions to capture positional information between the two modalities. Then, we use a Spatial Fusion Unit to adaptively enhance depth and RGB features spatially. Furthermore, we analyze the feature differences across various decoder levels and divide them into spatial and semantic branches. In the semantic branch, a high-level cross-modal fusion module extracts deep semantic information from adjacent high-level features through backpropagation, enabling RGB and depth layer reconstruction and mitigating information disparity and hierarchical differences with subtraction operations. In the spatial branch, a low-level cross-modal fusion module leverages spatial attention to enhance regional accuracy and reduce noise. MSANet achieves 52.0% mIoU on the NYU Depth v2 dataset, outperforming the baseline by 5.1%. On the more challenging SUN RGB-D dataset, MSANet achieves 49.0% mIoU. On the ScanNetV2 dataset, MSANet achieves 60.0% mIoU, further validating its effectiveness in complex indoor scenes.
期刊介绍:
Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface.
Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.