{"title":"RGB-D室内场景分割的级联注意力增强网络","authors":"Xu Tang , Songyang Cen , Zhanhao Deng , Zejun Zhang , Yan Meng , Jianxiao Xie , Changbing Tang , Weichuan Zhang , Guanghui Zhao","doi":"10.1016/j.cviu.2025.104411","DOIUrl":null,"url":null,"abstract":"<div><div>Convolutional neural network based Red, Green, Blue, and Depth (RGB-D) image semantic segmentation for indoor scenes has attracted increasing attention, because of its great potentiality of extracting semantic information from RGB-D images. However, the challenge it brings lies in how to effectively fuse features from RGB and depth images within the neural network architecture. The technical approach of feature aggregation has evolved from the early integration of RGB color images and depth images to the current cross-attention fusion, which enables the features of different RGB channels to be fully integrated with ones of the depth image. However, noises and useless feature for segmentation are inevitably propagated between feature layers during the period of feature aggregation, thereby affecting the accuracy of segmentation results. In this paper, for indoor scenes, a cascading attention enhancement network (CAENet) is proposed with the aim of progressively refining the semantic features of RGB and depth images layer by layer, consisting of four modules: a channel enhancement module (CEM), an adaptive aggregation of spatial attention (AASA), an adaptive aggregation of channel attention (AACA), and a triple-path fusion module (TFM). In encoding stage, CEM complements RGB features with depth features at the end of each layer, in order to effectively revise RGB features for the next layer. At the end of encoding stage, AASA module combines low-level and high-level RGB semantic features by their spatial attention, and AACA module fuses low-level and high-level depth semantic features by their channel attention. The combined RGB and depth semantic features are fused into one and fed into the decoding stage, which consists of triple-path fusion modules (TFMs) combining low-level RGB and depth semantic features and decoded high-level semantic features. The TFM outputs multi-scale feature maps that encapsulate both rich semantic information and fine-grained details, thereby augmenting the model’s capacity for accurate per-pixel semantic label prediction. The proposed CAENet achieves mIoU of 52.0% on NYUDv2 and 48.3% on SUNRGB-D datasets, outperforming recent RGB-D segmentation methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104411"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cascading attention enhancement network for RGB-D indoor scene segmentation\",\"authors\":\"Xu Tang , Songyang Cen , Zhanhao Deng , Zejun Zhang , Yan Meng , Jianxiao Xie , Changbing Tang , Weichuan Zhang , Guanghui Zhao\",\"doi\":\"10.1016/j.cviu.2025.104411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Convolutional neural network based Red, Green, Blue, and Depth (RGB-D) image semantic segmentation for indoor scenes has attracted increasing attention, because of its great potentiality of extracting semantic information from RGB-D images. However, the challenge it brings lies in how to effectively fuse features from RGB and depth images within the neural network architecture. The technical approach of feature aggregation has evolved from the early integration of RGB color images and depth images to the current cross-attention fusion, which enables the features of different RGB channels to be fully integrated with ones of the depth image. However, noises and useless feature for segmentation are inevitably propagated between feature layers during the period of feature aggregation, thereby affecting the accuracy of segmentation results. In this paper, for indoor scenes, a cascading attention enhancement network (CAENet) is proposed with the aim of progressively refining the semantic features of RGB and depth images layer by layer, consisting of four modules: a channel enhancement module (CEM), an adaptive aggregation of spatial attention (AASA), an adaptive aggregation of channel attention (AACA), and a triple-path fusion module (TFM). In encoding stage, CEM complements RGB features with depth features at the end of each layer, in order to effectively revise RGB features for the next layer. At the end of encoding stage, AASA module combines low-level and high-level RGB semantic features by their spatial attention, and AACA module fuses low-level and high-level depth semantic features by their channel attention. The combined RGB and depth semantic features are fused into one and fed into the decoding stage, which consists of triple-path fusion modules (TFMs) combining low-level RGB and depth semantic features and decoded high-level semantic features. The TFM outputs multi-scale feature maps that encapsulate both rich semantic information and fine-grained details, thereby augmenting the model’s capacity for accurate per-pixel semantic label prediction. The proposed CAENet achieves mIoU of 52.0% on NYUDv2 and 48.3% on SUNRGB-D datasets, outperforming recent RGB-D segmentation methods.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"259 \",\"pages\":\"Article 104411\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001341\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001341","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
基于卷积神经网络的RGB-D (Red, Green, Blue, and Depth)图像语义分割由于其从RGB-D图像中提取语义信息的巨大潜力而越来越受到人们的关注。然而,它带来的挑战在于如何在神经网络架构内有效地融合RGB和深度图像的特征。特征聚合的技术方法从早期的RGB彩色图像与深度图像的融合发展到现在的交叉关注融合,使不同RGB通道的特征与深度图像的特征充分融合。然而,在特征聚合过程中,不可避免地会在特征层之间传播噪声和对分割无用的特征,从而影响分割结果的准确性。本文针对室内场景,提出了一种层叠注意增强网络(CAENet),旨在逐层逐步提炼RGB和深度图像的语义特征,该网络由四个模块组成:通道增强模块(CEM)、空间注意自适应聚合模块(AASA)、通道注意自适应聚合模块(AACA)和三路融合模块(TFM)。在编码阶段,CEM在每一层的末尾用深度特征补充RGB特征,以便有效地修正下一层的RGB特征。编码阶段结束时,AASA模块通过空间关注将低级和高级RGB语义特征结合起来,AACA模块通过通道关注将低级和高级深度语义特征融合起来。将组合后的RGB和深度语义特征融合为一个并送入解码阶段,解码阶段由三路融合模块(TFMs)组成,该模块将低级RGB和深度语义特征与解码后的高级语义特征相结合。TFM输出的多尺度特征图封装了丰富的语义信息和细粒度的细节,从而增强了模型精确的逐像素语义标签预测能力。本文提出的CAENet在NYUDv2和SUNRGB-D数据集上的mIoU分别达到52.0%和48.3%,优于目前的RGB-D分割方法。
Cascading attention enhancement network for RGB-D indoor scene segmentation
Convolutional neural network based Red, Green, Blue, and Depth (RGB-D) image semantic segmentation for indoor scenes has attracted increasing attention, because of its great potentiality of extracting semantic information from RGB-D images. However, the challenge it brings lies in how to effectively fuse features from RGB and depth images within the neural network architecture. The technical approach of feature aggregation has evolved from the early integration of RGB color images and depth images to the current cross-attention fusion, which enables the features of different RGB channels to be fully integrated with ones of the depth image. However, noises and useless feature for segmentation are inevitably propagated between feature layers during the period of feature aggregation, thereby affecting the accuracy of segmentation results. In this paper, for indoor scenes, a cascading attention enhancement network (CAENet) is proposed with the aim of progressively refining the semantic features of RGB and depth images layer by layer, consisting of four modules: a channel enhancement module (CEM), an adaptive aggregation of spatial attention (AASA), an adaptive aggregation of channel attention (AACA), and a triple-path fusion module (TFM). In encoding stage, CEM complements RGB features with depth features at the end of each layer, in order to effectively revise RGB features for the next layer. At the end of encoding stage, AASA module combines low-level and high-level RGB semantic features by their spatial attention, and AACA module fuses low-level and high-level depth semantic features by their channel attention. The combined RGB and depth semantic features are fused into one and fed into the decoding stage, which consists of triple-path fusion modules (TFMs) combining low-level RGB and depth semantic features and decoded high-level semantic features. The TFM outputs multi-scale feature maps that encapsulate both rich semantic information and fine-grained details, thereby augmenting the model’s capacity for accurate per-pixel semantic label prediction. The proposed CAENet achieves mIoU of 52.0% on NYUDv2 and 48.3% on SUNRGB-D datasets, outperforming recent RGB-D segmentation methods.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems