DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

IF 6.4 4区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS
Zhenyu Li, Zehui Chen, Xianming Liu, Junjun Jiang
{"title":"DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation","authors":"Zhenyu Li, Zehui Chen, Xianming Liu, Junjun Jiang","doi":"10.1007/s11633-023-1458-0","DOIUrl":null,"url":null,"abstract":"Abstract This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.","PeriodicalId":29727,"journal":{"name":"Machine Intelligence Research","volume":null,"pages":null},"PeriodicalIF":6.4000,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"63","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Intelligence Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11633-023-1458-0","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 63

Abstract

Abstract This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.
DepthFormer:利用远程相关和局部信息进行精确的单目深度估计
摘要本文旨在解决有监督的单目深度估计问题。我们从细致的试点研究开始,以证明远程相关性对于准确的深度估计是必不可少的。此外,Transformer和convolution分别擅长远程和近距离深度估计。因此,我们建议采用由变压器分支和卷积分支组成的并行编码器架构。前者可以利用有效的注意机制对全局上下文进行建模,而后者的目的是保留局部信息,因为Transformer在对这些内容建模时缺乏空间归纳偏差。但是,独立的分支导致功能之间缺乏连接。为了弥补这一差距,我们设计了一个分层聚合和异构交互模块来增强Transformer特征,并以集合到集合的转换方式对异构特征之间的亲和力进行建模。由于全球对高分辨率特征图的关注带来了难以承受的存储成本,我们采用了可变形方案来降低复杂性。在KITTI、NYU和SUN RGB-D数据集上进行的大量实验表明,我们提出的模型(称为DepthFormer)超越了最先进的单目深度估计方法,具有显著的边际。每个提出的模块的有效性是通过细致和密集的消融研究精心评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.70
自引率
0.00%
发文量
0
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信