Shenglun Chen;Xinzhu Ma;Hong Zhang;Haojie Li;Baoli Sun;Zhihui Wang
{"title":"实时深度完成与多模态特征对齐","authors":"Shenglun Chen;Xinzhu Ma;Hong Zhang;Haojie Li;Baoli Sun;Zhihui Wang","doi":"10.1109/TNNLS.2025.3551903","DOIUrl":null,"url":null,"abstract":"As a key problem in computer vision, depth completion aims to recover dense depth maps from sparse ones [generally derived from light detection and ranging (LiDAR)]. Most methods introduce synchronous RGB images and leverage multimodal fusion to integrate multimodal features from these modalities to describe the complete scene. However, their different natural characteristics lead to inconsistency in features, potentially impacting the effectiveness of multimodal feature fusion. To address this issue, we propose a feature alignment network (FANet) that introduces an alignment scheme to enhance the consistency between multimodal features. This scheme aligns the modality-invariant semantic context, which is invariant to changes in modality and represents the correlation between a pixel and its surroundings. Specifically, we first design an asymmetric context extraction (ACE) module to extract modality-invariant semantic contexts from multimodal features within limited GPU memory, and then pull them closer to improve consistency. Crucially, our alignment scheme is only applied during the training phase, and no additional computation cost is incurred in the inference phase. Moreover, we introduce a simple yet effective refinement module to refine estimated results via residual learning based on intermediate depth maps and sparse depth maps. Extensive experiments on KITTI and VOID datasets demonstrate that our method achieves competitive performance against typical real-time methods. In addition, we embed the proposed alignment scheme and refinement module into other methods to demonstrate their effectiveness.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 9","pages":"16100-16112"},"PeriodicalIF":8.9000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Real-Time Depth Completion With Multimodal Feature Alignment\",\"authors\":\"Shenglun Chen;Xinzhu Ma;Hong Zhang;Haojie Li;Baoli Sun;Zhihui Wang\",\"doi\":\"10.1109/TNNLS.2025.3551903\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a key problem in computer vision, depth completion aims to recover dense depth maps from sparse ones [generally derived from light detection and ranging (LiDAR)]. Most methods introduce synchronous RGB images and leverage multimodal fusion to integrate multimodal features from these modalities to describe the complete scene. However, their different natural characteristics lead to inconsistency in features, potentially impacting the effectiveness of multimodal feature fusion. To address this issue, we propose a feature alignment network (FANet) that introduces an alignment scheme to enhance the consistency between multimodal features. This scheme aligns the modality-invariant semantic context, which is invariant to changes in modality and represents the correlation between a pixel and its surroundings. Specifically, we first design an asymmetric context extraction (ACE) module to extract modality-invariant semantic contexts from multimodal features within limited GPU memory, and then pull them closer to improve consistency. Crucially, our alignment scheme is only applied during the training phase, and no additional computation cost is incurred in the inference phase. Moreover, we introduce a simple yet effective refinement module to refine estimated results via residual learning based on intermediate depth maps and sparse depth maps. Extensive experiments on KITTI and VOID datasets demonstrate that our method achieves competitive performance against typical real-time methods. In addition, we embed the proposed alignment scheme and refinement module into other methods to demonstrate their effectiveness.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"36 9\",\"pages\":\"16100-16112\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10950123/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10950123/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Real-Time Depth Completion With Multimodal Feature Alignment
As a key problem in computer vision, depth completion aims to recover dense depth maps from sparse ones [generally derived from light detection and ranging (LiDAR)]. Most methods introduce synchronous RGB images and leverage multimodal fusion to integrate multimodal features from these modalities to describe the complete scene. However, their different natural characteristics lead to inconsistency in features, potentially impacting the effectiveness of multimodal feature fusion. To address this issue, we propose a feature alignment network (FANet) that introduces an alignment scheme to enhance the consistency between multimodal features. This scheme aligns the modality-invariant semantic context, which is invariant to changes in modality and represents the correlation between a pixel and its surroundings. Specifically, we first design an asymmetric context extraction (ACE) module to extract modality-invariant semantic contexts from multimodal features within limited GPU memory, and then pull them closer to improve consistency. Crucially, our alignment scheme is only applied during the training phase, and no additional computation cost is incurred in the inference phase. Moreover, we introduce a simple yet effective refinement module to refine estimated results via residual learning based on intermediate depth maps and sparse depth maps. Extensive experiments on KITTI and VOID datasets demonstrate that our method achieves competitive performance against typical real-time methods. In addition, we embed the proposed alignment scheme and refinement module into other methods to demonstrate their effectiveness.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.