实时深度完成与多模态特征对齐

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2025-04-07 DOI:10.1109/TNNLS.2025.3551903

Shenglun Chen;Xinzhu Ma;Hong Zhang;Haojie Li;Baoli Sun;Zhihui Wang

{"title":"实时深度完成与多模态特征对齐","authors":"Shenglun Chen;Xinzhu Ma;Hong Zhang;Haojie Li;Baoli Sun;Zhihui Wang","doi":"10.1109/TNNLS.2025.3551903","DOIUrl":null,"url":null,"abstract":"As a key problem in computer vision, depth completion aims to recover dense depth maps from sparse ones [generally derived from light detection and ranging (LiDAR)]. Most methods introduce synchronous RGB images and leverage multimodal fusion to integrate multimodal features from these modalities to describe the complete scene. However, their different natural characteristics lead to inconsistency in features, potentially impacting the effectiveness of multimodal feature fusion. To address this issue, we propose a feature alignment network (FANet) that introduces an alignment scheme to enhance the consistency between multimodal features. This scheme aligns the modality-invariant semantic context, which is invariant to changes in modality and represents the correlation between a pixel and its surroundings. Specifically, we first design an asymmetric context extraction (ACE) module to extract modality-invariant semantic contexts from multimodal features within limited GPU memory, and then pull them closer to improve consistency. Crucially, our alignment scheme is only applied during the training phase, and no additional computation cost is incurred in the inference phase. Moreover, we introduce a simple yet effective refinement module to refine estimated results via residual learning based on intermediate depth maps and sparse depth maps. Extensive experiments on KITTI and VOID datasets demonstrate that our method achieves competitive performance against typical real-time methods. In addition, we embed the proposed alignment scheme and refinement module into other methods to demonstrate their effectiveness.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 9","pages":"16100-16112"},"PeriodicalIF":8.9000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Real-Time Depth Completion With Multimodal Feature Alignment\",\"authors\":\"Shenglun Chen;Xinzhu Ma;Hong Zhang;Haojie Li;Baoli Sun;Zhihui Wang\",\"doi\":\"10.1109/TNNLS.2025.3551903\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a key problem in computer vision, depth completion aims to recover dense depth maps from sparse ones [generally derived from light detection and ranging (LiDAR)]. Most methods introduce synchronous RGB images and leverage multimodal fusion to integrate multimodal features from these modalities to describe the complete scene. However, their different natural characteristics lead to inconsistency in features, potentially impacting the effectiveness of multimodal feature fusion. To address this issue, we propose a feature alignment network (FANet) that introduces an alignment scheme to enhance the consistency between multimodal features. This scheme aligns the modality-invariant semantic context, which is invariant to changes in modality and represents the correlation between a pixel and its surroundings. Specifically, we first design an asymmetric context extraction (ACE) module to extract modality-invariant semantic contexts from multimodal features within limited GPU memory, and then pull them closer to improve consistency. Crucially, our alignment scheme is only applied during the training phase, and no additional computation cost is incurred in the inference phase. Moreover, we introduce a simple yet effective refinement module to refine estimated results via residual learning based on intermediate depth maps and sparse depth maps. Extensive experiments on KITTI and VOID datasets demonstrate that our method achieves competitive performance against typical real-time methods. In addition, we embed the proposed alignment scheme and refinement module into other methods to demonstrate their effectiveness.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"36 9\",\"pages\":\"16100-16112\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10950123/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10950123/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

作为计算机视觉中的一个关键问题，深度补全旨在从稀疏的深度图中恢复密集的深度图[通常来源于光探测和测距（LiDAR）]。大多数方法引入同步RGB图像，并利用多模态融合来整合这些模态的多模态特征来描述完整的场景。然而，它们不同的自然特征导致了特征的不一致性，从而影响了多模态特征融合的有效性。为了解决这个问题，我们提出了一个特征对齐网络（FANet），该网络引入了一种对齐方案来增强多模态特征之间的一致性。该方案对齐了模态不变的语义上下文，该语义上下文不受模态变化的影响，并表示像素与其周围环境之间的相关性。具体来说，我们首先设计了一个非对称上下文提取（ACE）模块，在有限的GPU内存中从多模态特征中提取模态不变的语义上下文，然后将它们拉得更近以提高一致性。关键的是，我们的对齐方案只在训练阶段应用，在推理阶段不会产生额外的计算成本。此外，我们还引入了一个简单而有效的改进模块，通过基于中间深度图和稀疏深度图的残差学习来改进估计结果。在KITTI和VOID数据集上的大量实验表明，我们的方法与典型的实时方法相比具有竞争力。此外，我们还将所提出的对齐方案和优化模块嵌入到其他方法中，以验证其有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Real-Time Depth Completion With Multimodal Feature Alignment

As a key problem in computer vision, depth completion aims to recover dense depth maps from sparse ones [generally derived from light detection and ranging (LiDAR)]. Most methods introduce synchronous RGB images and leverage multimodal fusion to integrate multimodal features from these modalities to describe the complete scene. However, their different natural characteristics lead to inconsistency in features, potentially impacting the effectiveness of multimodal feature fusion. To address this issue, we propose a feature alignment network (FANet) that introduces an alignment scheme to enhance the consistency between multimodal features. This scheme aligns the modality-invariant semantic context, which is invariant to changes in modality and represents the correlation between a pixel and its surroundings. Specifically, we first design an asymmetric context extraction (ACE) module to extract modality-invariant semantic contexts from multimodal features within limited GPU memory, and then pull them closer to improve consistency. Crucially, our alignment scheme is only applied during the training phase, and no additional computation cost is incurred in the inference phase. Moreover, we introduce a simple yet effective refinement module to refine estimated results via residual learning based on intermediate depth maps and sparse depth maps. Extensive experiments on KITTI and VOID datasets demonstrate that our method achieves competitive performance against typical real-time methods. In addition, we embed the proposed alignment scheme and refinement module into other methods to demonstrate their effectiveness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.