Cascading attention enhancement network for RGB-D indoor scene segmentation

IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Xu Tang , Songyang Cen , Zhanhao Deng , Zejun Zhang , Yan Meng , Jianxiao Xie , Changbing Tang , Weichuan Zhang , Guanghui Zhao
{"title":"Cascading attention enhancement network for RGB-D indoor scene segmentation","authors":"Xu Tang ,&nbsp;Songyang Cen ,&nbsp;Zhanhao Deng ,&nbsp;Zejun Zhang ,&nbsp;Yan Meng ,&nbsp;Jianxiao Xie ,&nbsp;Changbing Tang ,&nbsp;Weichuan Zhang ,&nbsp;Guanghui Zhao","doi":"10.1016/j.cviu.2025.104411","DOIUrl":null,"url":null,"abstract":"<div><div>Convolutional neural network based Red, Green, Blue, and Depth (RGB-D) image semantic segmentation for indoor scenes has attracted increasing attention, because of its great potentiality of extracting semantic information from RGB-D images. However, the challenge it brings lies in how to effectively fuse features from RGB and depth images within the neural network architecture. The technical approach of feature aggregation has evolved from the early integration of RGB color images and depth images to the current cross-attention fusion, which enables the features of different RGB channels to be fully integrated with ones of the depth image. However, noises and useless feature for segmentation are inevitably propagated between feature layers during the period of feature aggregation, thereby affecting the accuracy of segmentation results. In this paper, for indoor scenes, a cascading attention enhancement network (CAENet) is proposed with the aim of progressively refining the semantic features of RGB and depth images layer by layer, consisting of four modules: a channel enhancement module (CEM), an adaptive aggregation of spatial attention (AASA), an adaptive aggregation of channel attention (AACA), and a triple-path fusion module (TFM). In encoding stage, CEM complements RGB features with depth features at the end of each layer, in order to effectively revise RGB features for the next layer. At the end of encoding stage, AASA module combines low-level and high-level RGB semantic features by their spatial attention, and AACA module fuses low-level and high-level depth semantic features by their channel attention. The combined RGB and depth semantic features are fused into one and fed into the decoding stage, which consists of triple-path fusion modules (TFMs) combining low-level RGB and depth semantic features and decoded high-level semantic features. The TFM outputs multi-scale feature maps that encapsulate both rich semantic information and fine-grained details, thereby augmenting the model’s capacity for accurate per-pixel semantic label prediction. The proposed CAENet achieves mIoU of 52.0% on NYUDv2 and 48.3% on SUNRGB-D datasets, outperforming recent RGB-D segmentation methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104411"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001341","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Convolutional neural network based Red, Green, Blue, and Depth (RGB-D) image semantic segmentation for indoor scenes has attracted increasing attention, because of its great potentiality of extracting semantic information from RGB-D images. However, the challenge it brings lies in how to effectively fuse features from RGB and depth images within the neural network architecture. The technical approach of feature aggregation has evolved from the early integration of RGB color images and depth images to the current cross-attention fusion, which enables the features of different RGB channels to be fully integrated with ones of the depth image. However, noises and useless feature for segmentation are inevitably propagated between feature layers during the period of feature aggregation, thereby affecting the accuracy of segmentation results. In this paper, for indoor scenes, a cascading attention enhancement network (CAENet) is proposed with the aim of progressively refining the semantic features of RGB and depth images layer by layer, consisting of four modules: a channel enhancement module (CEM), an adaptive aggregation of spatial attention (AASA), an adaptive aggregation of channel attention (AACA), and a triple-path fusion module (TFM). In encoding stage, CEM complements RGB features with depth features at the end of each layer, in order to effectively revise RGB features for the next layer. At the end of encoding stage, AASA module combines low-level and high-level RGB semantic features by their spatial attention, and AACA module fuses low-level and high-level depth semantic features by their channel attention. The combined RGB and depth semantic features are fused into one and fed into the decoding stage, which consists of triple-path fusion modules (TFMs) combining low-level RGB and depth semantic features and decoded high-level semantic features. The TFM outputs multi-scale feature maps that encapsulate both rich semantic information and fine-grained details, thereby augmenting the model’s capacity for accurate per-pixel semantic label prediction. The proposed CAENet achieves mIoU of 52.0% on NYUDv2 and 48.3% on SUNRGB-D datasets, outperforming recent RGB-D segmentation methods.
RGB-D室内场景分割的级联注意力增强网络
基于卷积神经网络的RGB-D (Red, Green, Blue, and Depth)图像语义分割由于其从RGB-D图像中提取语义信息的巨大潜力而越来越受到人们的关注。然而,它带来的挑战在于如何在神经网络架构内有效地融合RGB和深度图像的特征。特征聚合的技术方法从早期的RGB彩色图像与深度图像的融合发展到现在的交叉关注融合,使不同RGB通道的特征与深度图像的特征充分融合。然而,在特征聚合过程中,不可避免地会在特征层之间传播噪声和对分割无用的特征,从而影响分割结果的准确性。本文针对室内场景,提出了一种层叠注意增强网络(CAENet),旨在逐层逐步提炼RGB和深度图像的语义特征,该网络由四个模块组成:通道增强模块(CEM)、空间注意自适应聚合模块(AASA)、通道注意自适应聚合模块(AACA)和三路融合模块(TFM)。在编码阶段,CEM在每一层的末尾用深度特征补充RGB特征,以便有效地修正下一层的RGB特征。编码阶段结束时,AASA模块通过空间关注将低级和高级RGB语义特征结合起来,AACA模块通过通道关注将低级和高级深度语义特征融合起来。将组合后的RGB和深度语义特征融合为一个并送入解码阶段,解码阶段由三路融合模块(TFMs)组成,该模块将低级RGB和深度语义特征与解码后的高级语义特征相结合。TFM输出的多尺度特征图封装了丰富的语义信息和细粒度的细节,从而增强了模型精确的逐像素语义标签预测能力。本文提出的CAENet在NYUDv2和SUNRGB-D数据集上的mIoU分别达到52.0%和48.3%,优于目前的RGB-D分割方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Vision and Image Understanding
Computer Vision and Image Understanding 工程技术-工程:电子与电气
CiteScore
7.80
自引率
4.40%
发文量
112
审稿时长
79 days
期刊介绍: The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信