Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang
{"title":"SDformer:用于深度补全的高效端到端变换器","authors":"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang","doi":"arxiv-2409.08159","DOIUrl":null,"url":null,"abstract":"Depth completion aims to predict dense depth maps with sparse depth\nmeasurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\nbased models are the most popular methods applied to depth completion tasks.\nHowever, despite the excellent high-end performance, they suffer from a limited\nrepresentation area. To overcome the drawbacks of CNNs, a more effective and\npowerful method has been presented: the Transformer, which is an adaptive\nself-attention setting sequence-to-sequence model. While the standard\nTransformer quadratically increases the computational cost from the key-query\ndot-product of input resolution which improperly employs depth completion\ntasks. In this work, we propose a different window-based Transformer\narchitecture for depth completion tasks named Sparse-to-Dense Transformer\n(SDformer). The network consists of an input module for the depth map and RGB\nimage features extraction and concatenation, a U-shaped encoder-decoder\nTransformer for extracting deep features, and a refinement module.\nSpecifically, we first concatenate the depth map features with the RGB image\nfeatures through the input model. Then, instead of calculating self-attention\nwith the whole feature maps, we apply different window sizes to extract the\nlong-range depth dependencies. Finally, we refine the predicted features from\nthe input module and the U-shaped encoder-decoder Transformer module to get the\nenriching depth features and employ a convolution layer to obtain the dense\ndepth map. In practice, the SDformer obtains state-of-the-art results against\nthe CNN-based depth completion models with lower computing loads and parameters\non the NYU Depth V2 and KITTI DC datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SDformer: Efficient End-to-End Transformer for Depth Completion\",\"authors\":\"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang\",\"doi\":\"arxiv-2409.08159\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Depth completion aims to predict dense depth maps with sparse depth\\nmeasurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\\nbased models are the most popular methods applied to depth completion tasks.\\nHowever, despite the excellent high-end performance, they suffer from a limited\\nrepresentation area. To overcome the drawbacks of CNNs, a more effective and\\npowerful method has been presented: the Transformer, which is an adaptive\\nself-attention setting sequence-to-sequence model. While the standard\\nTransformer quadratically increases the computational cost from the key-query\\ndot-product of input resolution which improperly employs depth completion\\ntasks. In this work, we propose a different window-based Transformer\\narchitecture for depth completion tasks named Sparse-to-Dense Transformer\\n(SDformer). The network consists of an input module for the depth map and RGB\\nimage features extraction and concatenation, a U-shaped encoder-decoder\\nTransformer for extracting deep features, and a refinement module.\\nSpecifically, we first concatenate the depth map features with the RGB image\\nfeatures through the input model. Then, instead of calculating self-attention\\nwith the whole feature maps, we apply different window sizes to extract the\\nlong-range depth dependencies. Finally, we refine the predicted features from\\nthe input module and the U-shaped encoder-decoder Transformer module to get the\\nenriching depth features and employ a convolution layer to obtain the dense\\ndepth map. In practice, the SDformer obtains state-of-the-art results against\\nthe CNN-based depth completion models with lower computing loads and parameters\\non the NYU Depth V2 and KITTI DC datasets.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"60 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08159\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
深度补全旨在利用深度传感器的稀疏深度测量数据预测密集深度图。目前,基于卷积神经网络(CNN)的模型是深度补全任务中最常用的方法。然而,尽管这些模型具有出色的高端性能,但它们的代表性区域有限。为了克服卷积神经网络的缺点,人们提出了一种更有效、更强大的方法:Transformer,它是一种自适应自我关注设置序列到序列模型。标准的变换器会因输入分辨率的 key-querydot-product 而四倍地增加计算成本,这就不适当地使用了深度完成任务。在这项工作中,我们针对深度补全任务提出了一种不同的基于窗口的变换器架构,命名为稀疏到密集变换器(SDformer)。具体来说,我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来。具体来说,我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来,然后,我们不再使用整个特征图来计算自注意力,而是使用不同大小的窗口来提取长距离深度依赖关系。最后,我们对输入模块和 U 型编码器-解码器变换器模块的预测特征进行细化,得到丰富的深度特征,并利用卷积层获得有密度的深度图。在实践中,SDformer 在 NYU Depth V2 和 KITTI DC 数据集上以较低的计算负荷和参数获得了与基于 CNN 的深度补全模型相比最先进的结果。
SDformer: Efficient End-to-End Transformer for Depth Completion
Depth completion aims to predict dense depth maps with sparse depth
measurements from a depth sensor. Currently, Convolutional Neural Network (CNN)
based models are the most popular methods applied to depth completion tasks.
However, despite the excellent high-end performance, they suffer from a limited
representation area. To overcome the drawbacks of CNNs, a more effective and
powerful method has been presented: the Transformer, which is an adaptive
self-attention setting sequence-to-sequence model. While the standard
Transformer quadratically increases the computational cost from the key-query
dot-product of input resolution which improperly employs depth completion
tasks. In this work, we propose a different window-based Transformer
architecture for depth completion tasks named Sparse-to-Dense Transformer
(SDformer). The network consists of an input module for the depth map and RGB
image features extraction and concatenation, a U-shaped encoder-decoder
Transformer for extracting deep features, and a refinement module.
Specifically, we first concatenate the depth map features with the RGB image
features through the input model. Then, instead of calculating self-attention
with the whole feature maps, we apply different window sizes to extract the
long-range depth dependencies. Finally, we refine the predicted features from
the input module and the U-shaped encoder-decoder Transformer module to get the
enriching depth features and employ a convolution layer to obtain the dense
depth map. In practice, the SDformer obtains state-of-the-art results against
the CNN-based depth completion models with lower computing loads and parameters
on the NYU Depth V2 and KITTI DC datasets.