SDformer: Efficient End-to-End Transformer for Depth Completion

Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang
{"title":"SDformer: Efficient End-to-End Transformer for Depth Completion","authors":"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang","doi":"arxiv-2409.08159","DOIUrl":null,"url":null,"abstract":"Depth completion aims to predict dense depth maps with sparse depth\nmeasurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\nbased models are the most popular methods applied to depth completion tasks.\nHowever, despite the excellent high-end performance, they suffer from a limited\nrepresentation area. To overcome the drawbacks of CNNs, a more effective and\npowerful method has been presented: the Transformer, which is an adaptive\nself-attention setting sequence-to-sequence model. While the standard\nTransformer quadratically increases the computational cost from the key-query\ndot-product of input resolution which improperly employs depth completion\ntasks. In this work, we propose a different window-based Transformer\narchitecture for depth completion tasks named Sparse-to-Dense Transformer\n(SDformer). The network consists of an input module for the depth map and RGB\nimage features extraction and concatenation, a U-shaped encoder-decoder\nTransformer for extracting deep features, and a refinement module.\nSpecifically, we first concatenate the depth map features with the RGB image\nfeatures through the input model. Then, instead of calculating self-attention\nwith the whole feature maps, we apply different window sizes to extract the\nlong-range depth dependencies. Finally, we refine the predicted features from\nthe input module and the U-shaped encoder-decoder Transformer module to get the\nenriching depth features and employ a convolution layer to obtain the dense\ndepth map. In practice, the SDformer obtains state-of-the-art results against\nthe CNN-based depth completion models with lower computing loads and parameters\non the NYU Depth V2 and KITTI DC datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.
SDformer:用于深度补全的高效端到端变换器
深度补全旨在利用深度传感器的稀疏深度测量数据预测密集深度图。目前,基于卷积神经网络(CNN)的模型是深度补全任务中最常用的方法。然而,尽管这些模型具有出色的高端性能,但它们的代表性区域有限。为了克服卷积神经网络的缺点,人们提出了一种更有效、更强大的方法:Transformer,它是一种自适应自我关注设置序列到序列模型。标准的变换器会因输入分辨率的 key-querydot-product 而四倍地增加计算成本,这就不适当地使用了深度完成任务。在这项工作中,我们针对深度补全任务提出了一种不同的基于窗口的变换器架构,命名为稀疏到密集变换器(SDformer)。具体来说,我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来。具体来说,我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来,然后,我们不再使用整个特征图来计算自注意力,而是使用不同大小的窗口来提取长距离深度依赖关系。最后,我们对输入模块和 U 型编码器-解码器变换器模块的预测特征进行细化,得到丰富的深度特征,并利用卷积层获得有密度的深度图。在实践中,SDformer 在 NYU Depth V2 和 KITTI DC 数据集上以较低的计算负荷和参数获得了与基于 CNN 的深度补全模型相比最先进的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信