用于RGB-D语义分割的特征校准和边缘引导MLP解码器网络

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-07-17 DOI:10.1016/j.cviu.2025.104448

Yiming Lu , Bin Ge , Chenxing Xia , Xu Zhu , Mengge Zhang , Mengya Gao , Ningjie Chen , Jianjun Hu , Junjie Zhi

{"title":"用于RGB-D语义分割的特征校准和边缘引导MLP解码器网络","authors":"Yiming Lu , Bin Ge , Chenxing Xia , Xu Zhu , Mengge Zhang , Mengya Gao , Ningjie Chen , Jianjun Hu , Junjie Zhi","doi":"10.1016/j.cviu.2025.104448","DOIUrl":null,"url":null,"abstract":"<div><div>The references from depth image data provide rich geometric information for traditional RGB semantic segmentation, which effectively improves the performance of semantic segmentation. However, during the process of feature fusion, there are feature biases between RGB features and depth features, which negatively affect cross-modal feature fusion. In this paper, we propose a novel RGB-D network, FCEGNet, consisting of a Feature Calibration Interaction Module (FCIM), a Three-Stream Fusion Extraction Module(TFEM), and an edge-guided MLP decoder. FCIM processes features in different orientations and scales by balancing features across modalities, and exchanges spatial information to allow RGB and depth features to be calibrated and interact with cross-modal features. TFEM performs feature extraction on cross-modal features and combines them with unimodal features to improve the accuracy of enhanced semantic understanding and fine-grained recognition. Dual-stream edge guidance module (DEGM) is designed in the edge-guided MLP decoder to protect the consistency and disparity of cross-modal features while enhancing the edge information and preserving the spatial information, which helps to obtain more accurate segmentation results. Experimental results on the RGB-D dataset show that the proposed FCFGNet is superior and more efficient than several state-of-the-art methods. The generalised validation of FCEGNet on the RGB-T semantic segmentation dataset also achieves better results.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104448"},"PeriodicalIF":3.5000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FCEGNet: Feature calibration and edge-guided MLP decoder Network for RGB-D semantic segmentation\",\"authors\":\"Yiming Lu , Bin Ge , Chenxing Xia , Xu Zhu , Mengge Zhang , Mengya Gao , Ningjie Chen , Jianjun Hu , Junjie Zhi\",\"doi\":\"10.1016/j.cviu.2025.104448\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The references from depth image data provide rich geometric information for traditional RGB semantic segmentation, which effectively improves the performance of semantic segmentation. However, during the process of feature fusion, there are feature biases between RGB features and depth features, which negatively affect cross-modal feature fusion. In this paper, we propose a novel RGB-D network, FCEGNet, consisting of a Feature Calibration Interaction Module (FCIM), a Three-Stream Fusion Extraction Module(TFEM), and an edge-guided MLP decoder. FCIM processes features in different orientations and scales by balancing features across modalities, and exchanges spatial information to allow RGB and depth features to be calibrated and interact with cross-modal features. TFEM performs feature extraction on cross-modal features and combines them with unimodal features to improve the accuracy of enhanced semantic understanding and fine-grained recognition. Dual-stream edge guidance module (DEGM) is designed in the edge-guided MLP decoder to protect the consistency and disparity of cross-modal features while enhancing the edge information and preserving the spatial information, which helps to obtain more accurate segmentation results. Experimental results on the RGB-D dataset show that the proposed FCFGNet is superior and more efficient than several state-of-the-art methods. The generalised validation of FCEGNet on the RGB-T semantic segmentation dataset also achieves better results.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"260 \",\"pages\":\"Article 104448\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001717\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001717","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

深度图像数据的参考为传统的RGB语义分割提供了丰富的几何信息，有效地提高了语义分割的性能。然而，在特征融合过程中，RGB特征与深度特征之间存在特征偏差，对跨模态特征融合产生不利影响。在本文中，我们提出了一种新的RGB-D网络FCEGNet，该网络由特征校准交互模块（FCIM）、三流融合提取模块（TFEM）和边缘引导MLP解码器组成。FCIM通过平衡不同模态的特征来处理不同方向和尺度的特征，并交换空间信息以允许RGB和深度特征进行校准，并与跨模态特征进行交互。TFEM对跨模态特征进行特征提取，并将其与单模态特征结合，提高增强语义理解和细粒度识别的准确性。在边缘引导的MLP解码器中设计了双流边缘引导模块（DEGM），在增强边缘信息和保留空间信息的同时，保护了跨模态特征的一致性和差异性，有助于获得更准确的分割结果。在RGB-D数据集上的实验结果表明，所提出的FCFGNet比现有的几种方法更优越，效率更高。FCEGNet在RGB-T语义分割数据集上的泛化验证也取得了较好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FCEGNet: Feature calibration and edge-guided MLP decoder Network for RGB-D semantic segmentation

The references from depth image data provide rich geometric information for traditional RGB semantic segmentation, which effectively improves the performance of semantic segmentation. However, during the process of feature fusion, there are feature biases between RGB features and depth features, which negatively affect cross-modal feature fusion. In this paper, we propose a novel RGB-D network, FCEGNet, consisting of a Feature Calibration Interaction Module (FCIM), a Three-Stream Fusion Extraction Module(TFEM), and an edge-guided MLP decoder. FCIM processes features in different orientations and scales by balancing features across modalities, and exchanges spatial information to allow RGB and depth features to be calibrated and interact with cross-modal features. TFEM performs feature extraction on cross-modal features and combines them with unimodal features to improve the accuracy of enhanced semantic understanding and fine-grained recognition. Dual-stream edge guidance module (DEGM) is designed in the edge-guided MLP decoder to protect the consistency and disparity of cross-modal features while enhancing the edge information and preserving the spatial information, which helps to obtain more accurate segmentation results. Experimental results on the RGB-D dataset show that the proposed FCFGNet is superior and more efficient than several state-of-the-art methods. The generalised validation of FCEGNet on the RGB-T semantic segmentation dataset also achieves better results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems