SDformer：用于深度补全的高效端到端变换器

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI:arxiv-2409.08159

Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang

{"title":"SDformer：用于深度补全的高效端到端变换器","authors":"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang","doi":"arxiv-2409.08159","DOIUrl":null,"url":null,"abstract":"Depth completion aims to predict dense depth maps with sparse depth\nmeasurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\nbased models are the most popular methods applied to depth completion tasks.\nHowever, despite the excellent high-end performance, they suffer from a limited\nrepresentation area. To overcome the drawbacks of CNNs, a more effective and\npowerful method has been presented: the Transformer, which is an adaptive\nself-attention setting sequence-to-sequence model. While the standard\nTransformer quadratically increases the computational cost from the key-query\ndot-product of input resolution which improperly employs depth completion\ntasks. In this work, we propose a different window-based Transformer\narchitecture for depth completion tasks named Sparse-to-Dense Transformer\n(SDformer). The network consists of an input module for the depth map and RGB\nimage features extraction and concatenation, a U-shaped encoder-decoder\nTransformer for extracting deep features, and a refinement module.\nSpecifically, we first concatenate the depth map features with the RGB image\nfeatures through the input model. Then, instead of calculating self-attention\nwith the whole feature maps, we apply different window sizes to extract the\nlong-range depth dependencies. Finally, we refine the predicted features from\nthe input module and the U-shaped encoder-decoder Transformer module to get the\nenriching depth features and employ a convolution layer to obtain the dense\ndepth map. In practice, the SDformer obtains state-of-the-art results against\nthe CNN-based depth completion models with lower computing loads and parameters\non the NYU Depth V2 and KITTI DC datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SDformer: Efficient End-to-End Transformer for Depth Completion\",\"authors\":\"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang\",\"doi\":\"arxiv-2409.08159\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Depth completion aims to predict dense depth maps with sparse depth\\nmeasurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\\nbased models are the most popular methods applied to depth completion tasks.\\nHowever, despite the excellent high-end performance, they suffer from a limited\\nrepresentation area. To overcome the drawbacks of CNNs, a more effective and\\npowerful method has been presented: the Transformer, which is an adaptive\\nself-attention setting sequence-to-sequence model. While the standard\\nTransformer quadratically increases the computational cost from the key-query\\ndot-product of input resolution which improperly employs depth completion\\ntasks. In this work, we propose a different window-based Transformer\\narchitecture for depth completion tasks named Sparse-to-Dense Transformer\\n(SDformer). The network consists of an input module for the depth map and RGB\\nimage features extraction and concatenation, a U-shaped encoder-decoder\\nTransformer for extracting deep features, and a refinement module.\\nSpecifically, we first concatenate the depth map features with the RGB image\\nfeatures through the input model. Then, instead of calculating self-attention\\nwith the whole feature maps, we apply different window sizes to extract the\\nlong-range depth dependencies. Finally, we refine the predicted features from\\nthe input module and the U-shaped encoder-decoder Transformer module to get the\\nenriching depth features and employ a convolution layer to obtain the dense\\ndepth map. In practice, the SDformer obtains state-of-the-art results against\\nthe CNN-based depth completion models with lower computing loads and parameters\\non the NYU Depth V2 and KITTI DC datasets.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"60 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08159\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

深度补全旨在利用深度传感器的稀疏深度测量数据预测密集深度图。目前，基于卷积神经网络（CNN）的模型是深度补全任务中最常用的方法。然而，尽管这些模型具有出色的高端性能，但它们的代表性区域有限。为了克服卷积神经网络的缺点，人们提出了一种更有效、更强大的方法：Transformer，它是一种自适应自我关注设置序列到序列模型。标准的变换器会因输入分辨率的 key-querydot-product 而四倍地增加计算成本，这就不适当地使用了深度完成任务。在这项工作中，我们针对深度补全任务提出了一种不同的基于窗口的变换器架构，命名为稀疏到密集变换器（SDformer）。具体来说，我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来。具体来说，我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来，然后，我们不再使用整个特征图来计算自注意力，而是使用不同大小的窗口来提取长距离深度依赖关系。最后，我们对输入模块和 U 型编码器-解码器变换器模块的预测特征进行细化，得到丰富的深度特征，并利用卷积层获得有密度的深度图。在实践中，SDformer 在 NYU Depth V2 和 KITTI DC 数据集上以较低的计算负荷和参数获得了与基于 CNN 的深度补全模型相比最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SDformer: Efficient End-to-End Transformer for Depth Completion

Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量