CAFNet：上下文对齐融合，实现深度补全

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2024-09-11 DOI:10.1016/j.cviu.2024.104158

Zhichao Fu, Anran Wu, Shuwen Yang, Tianlong Ma, Liang He

{"title":"CAFNet：上下文对齐融合，实现深度补全","authors":"Zhichao Fu, Anran Wu, Shuwen Yang, Tianlong Ma, Liang He","doi":"10.1016/j.cviu.2024.104158","DOIUrl":null,"url":null,"abstract":"<div><p>Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104158"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CAFNet: Context aligned fusion for depth completion\",\"authors\":\"Zhichao Fu, Anran Wu, Shuwen Yang, Tianlong Ma, Liang He\",\"doi\":\"10.1016/j.cviu.2024.104158\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.</p></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"249 \",\"pages\":\"Article 104158\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S107731422400239X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S107731422400239X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

深度补全旨在从稀疏的深度输入中重建密集的深度，通常使用彩色图像作为指导。稀疏深度图缺乏足够的上下文来重建物体形状等焦点上下文。RGB 图像包含冗余上下文，包括对重建无用的细节，这降低了焦点上下文提取的效率。这两种模式的上下文信息不一致，给焦点上下文提取和进一步融合以及深度补全的准确性带来了挑战。为了优化多模态上下文信息的利用，我们探索了一种新颖的框架：上下文对齐融合网络（CAFNet）。CAFNet 包括两个阶段：上下文对齐阶段和全面阶段。在上下文对齐阶段，CAFNet 对输入的 RGB-D 对进行缩放采样，在此阶段，多模态上下文信息得到充分对齐，以便在两个编码器中进行特征提取，并在 CF 模块中进行融合。在全尺度阶段，上一阶段融合了多模态上下文的特征图被上采样到原始尺度，随后由 GF 模块利用动态屏蔽融合策略与全尺度深度特征融合。最终，利用 GF 模块的结果特征重建精确的密集深度图。在室内和室外基准数据集上进行的实验表明，CAFNet 所产生的结果可与最先进的方法相媲美，同时还能有效降低计算成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CAFNet: Context aligned fusion for depth completion

Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems