DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

IF 8.7 4区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Machine Intelligence Research Pub Date : 2023-09-13 DOI:10.1007/s11633-023-1458-0

Zhenyu Li, Zehui Chen, Xianming Liu, Junjun Jiang

{"title":"DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation","authors":"Zhenyu Li, Zehui Chen, Xianming Liu, Junjun Jiang","doi":"10.1007/s11633-023-1458-0","DOIUrl":null,"url":null,"abstract":"Abstract This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.","PeriodicalId":29727,"journal":{"name":"Machine Intelligence Research","volume":"88 1","pages":"0"},"PeriodicalIF":8.7000,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"63","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Intelligence Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11633-023-1458-0","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 63

Abstract

Abstract This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.

查看原文本刊更多论文

DepthFormer:利用远程相关和局部信息进行精确的单目深度估计

摘要本文旨在解决有监督的单目深度估计问题。我们从细致的试点研究开始，以证明远程相关性对于准确的深度估计是必不可少的。此外，Transformer和convolution分别擅长远程和近距离深度估计。因此，我们建议采用由变压器分支和卷积分支组成的并行编码器架构。前者可以利用有效的注意机制对全局上下文进行建模，而后者的目的是保留局部信息，因为Transformer在对这些内容建模时缺乏空间归纳偏差。但是，独立的分支导致功能之间缺乏连接。为了弥补这一差距，我们设计了一个分层聚合和异构交互模块来增强Transformer特征，并以集合到集合的转换方式对异构特征之间的亲和力进行建模。由于全球对高分辨率特征图的关注带来了难以承受的存储成本，我们采用了可变形方案来降低复杂性。在KITTI、NYU和SUN RGB-D数据集上进行的大量实验表明，我们提出的模型(称为DepthFormer)超越了最先进的单目深度估计方法，具有显著的边际。每个提出的模块的有效性是通过细致和密集的消融研究精心评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine Intelligence Research

CiteScore

6.70

自引率

0.00%

发文量