基于通道关注的RGB-D图像分割层次视觉变换

Proceedings of the 4th International Symposium on Signal Processing Systems Pub Date : 2022-03-25 DOI:10.1145/3532342.3532352

Yali Yang, Yuanping Xu, Chaolong Zhang, Zhijie Xu, Jian Huang

{"title":"基于通道关注的RGB-D图像分割层次视觉变换","authors":"Yali Yang, Yuanping Xu, Chaolong Zhang, Zhijie Xu, Jian Huang","doi":"10.1145/3532342.3532352","DOIUrl":null,"url":null,"abstract":"Although convolutional neural networks (CNNs) have become the mainstream for image processing and achieved great success in the past decade, due to the local characteristics, CNN is difficult to obtain global and long-range semantical information. Moreover, in some scenes, the pure RGB image-based model is difficult to accurately identify the pixel classification and finely segment the edge of objects. This study presents a hierarchical vision Transformer model named Swin-RGB-D to incorporate and exploit the depth information in depth images to supplement and enhance the ambiguous and obscure features in RGB images. In this design, RGB and depth images are used as the two inputs of the two-branch network. The upstream branch applies the Swin Transform which is capable of learning global continuous information from RGB images for segmentation; the other branch performs channel attention on depth image to abstract the feature correlation and dependency between channels and generates a weight matrix. Then matrix multiplication on the feature maps in each stage of the down-sampling process is performed for weighted multi-modal feature extraction. Then this study adds the fused maps to the up-sampled feature maps of the corresponding size, which sufficiently compensates for the distortion of feature in the sampling process. The experiment results on the two benchmark datasets show that the proposed model makes the network more sensitive to edge information.","PeriodicalId":398859,"journal":{"name":"Proceedings of the 4th International Symposium on Signal Processing Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Hierarchical Vision Transformer with Channel Attention for RGB-D Image Segmentation\",\"authors\":\"Yali Yang, Yuanping Xu, Chaolong Zhang, Zhijie Xu, Jian Huang\",\"doi\":\"10.1145/3532342.3532352\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although convolutional neural networks (CNNs) have become the mainstream for image processing and achieved great success in the past decade, due to the local characteristics, CNN is difficult to obtain global and long-range semantical information. Moreover, in some scenes, the pure RGB image-based model is difficult to accurately identify the pixel classification and finely segment the edge of objects. This study presents a hierarchical vision Transformer model named Swin-RGB-D to incorporate and exploit the depth information in depth images to supplement and enhance the ambiguous and obscure features in RGB images. In this design, RGB and depth images are used as the two inputs of the two-branch network. The upstream branch applies the Swin Transform which is capable of learning global continuous information from RGB images for segmentation; the other branch performs channel attention on depth image to abstract the feature correlation and dependency between channels and generates a weight matrix. Then matrix multiplication on the feature maps in each stage of the down-sampling process is performed for weighted multi-modal feature extraction. Then this study adds the fused maps to the up-sampled feature maps of the corresponding size, which sufficiently compensates for the distortion of feature in the sampling process. The experiment results on the two benchmark datasets show that the proposed model makes the network more sensitive to edge information.\",\"PeriodicalId\":398859,\"journal\":{\"name\":\"Proceedings of the 4th International Symposium on Signal Processing Systems\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Symposium on Signal Processing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3532342.3532352\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Symposium on Signal Processing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3532342.3532352","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

虽然卷积神经网络(convolutional neural networks, CNN)在近十年来已经成为图像处理的主流并取得了巨大的成功，但由于其局部特性，CNN难以获得全局和远距离的语义信息。此外，在某些场景中，单纯基于RGB图像的模型难以准确识别像素分类和精细分割物体边缘。本文提出了一种名为Swin-RGB-D的分层视觉Transformer模型，用于整合和利用深度图像中的深度信息，以补充和增强RGB图像中模糊和模糊的特征。在本设计中，RGB图像和深度图像作为两支网络的两个输入。上游分支采用能够从RGB图像中学习全局连续信息的Swin变换进行分割;另一个分支对深度图像进行通道关注，提取通道间的特征相关性和依赖性，生成权重矩阵。然后对下采样各阶段的特征映射进行矩阵乘法，进行加权多模态特征提取。然后将融合后的图像加入到相应大小的上采样特征图中，充分补偿了采样过程中特征的失真。在两个基准数据集上的实验结果表明，该模型使网络对边缘信息更加敏感。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hierarchical Vision Transformer with Channel Attention for RGB-D Image Segmentation

Although convolutional neural networks (CNNs) have become the mainstream for image processing and achieved great success in the past decade, due to the local characteristics, CNN is difficult to obtain global and long-range semantical information. Moreover, in some scenes, the pure RGB image-based model is difficult to accurately identify the pixel classification and finely segment the edge of objects. This study presents a hierarchical vision Transformer model named Swin-RGB-D to incorporate and exploit the depth information in depth images to supplement and enhance the ambiguous and obscure features in RGB images. In this design, RGB and depth images are used as the two inputs of the two-branch network. The upstream branch applies the Swin Transform which is capable of learning global continuous information from RGB images for segmentation; the other branch performs channel attention on depth image to abstract the feature correlation and dependency between channels and generates a weight matrix. Then matrix multiplication on the feature maps in each stage of the down-sampling process is performed for weighted multi-modal feature extraction. Then this study adds the fused maps to the up-sampled feature maps of the corresponding size, which sufficiently compensates for the distortion of feature in the sampling process. The experiment results on the two benchmark datasets show that the proposed model makes the network more sensitive to edge information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 4th International Symposium on Signal Processing Systems

自引率

0.00%

发文量