Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang
{"title":"SDformer: Efficient End-to-End Transformer for Depth Completion","authors":"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang","doi":"arxiv-2409.08159","DOIUrl":null,"url":null,"abstract":"Depth completion aims to predict dense depth maps with sparse depth\nmeasurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\nbased models are the most popular methods applied to depth completion tasks.\nHowever, despite the excellent high-end performance, they suffer from a limited\nrepresentation area. To overcome the drawbacks of CNNs, a more effective and\npowerful method has been presented: the Transformer, which is an adaptive\nself-attention setting sequence-to-sequence model. While the standard\nTransformer quadratically increases the computational cost from the key-query\ndot-product of input resolution which improperly employs depth completion\ntasks. In this work, we propose a different window-based Transformer\narchitecture for depth completion tasks named Sparse-to-Dense Transformer\n(SDformer). The network consists of an input module for the depth map and RGB\nimage features extraction and concatenation, a U-shaped encoder-decoder\nTransformer for extracting deep features, and a refinement module.\nSpecifically, we first concatenate the depth map features with the RGB image\nfeatures through the input model. Then, instead of calculating self-attention\nwith the whole feature maps, we apply different window sizes to extract the\nlong-range depth dependencies. Finally, we refine the predicted features from\nthe input module and the U-shaped encoder-decoder Transformer module to get the\nenriching depth features and employ a convolution layer to obtain the dense\ndepth map. In practice, the SDformer obtains state-of-the-art results against\nthe CNN-based depth completion models with lower computing loads and parameters\non the NYU Depth V2 and KITTI DC datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Depth completion aims to predict dense depth maps with sparse depth
measurements from a depth sensor. Currently, Convolutional Neural Network (CNN)
based models are the most popular methods applied to depth completion tasks.
However, despite the excellent high-end performance, they suffer from a limited
representation area. To overcome the drawbacks of CNNs, a more effective and
powerful method has been presented: the Transformer, which is an adaptive
self-attention setting sequence-to-sequence model. While the standard
Transformer quadratically increases the computational cost from the key-query
dot-product of input resolution which improperly employs depth completion
tasks. In this work, we propose a different window-based Transformer
architecture for depth completion tasks named Sparse-to-Dense Transformer
(SDformer). The network consists of an input module for the depth map and RGB
image features extraction and concatenation, a U-shaped encoder-decoder
Transformer for extracting deep features, and a refinement module.
Specifically, we first concatenate the depth map features with the RGB image
features through the input model. Then, instead of calculating self-attention
with the whole feature maps, we apply different window sizes to extract the
long-range depth dependencies. Finally, we refine the predicted features from
the input module and the U-shaped encoder-decoder Transformer module to get the
enriching depth features and employ a convolution layer to obtain the dense
depth map. In practice, the SDformer obtains state-of-the-art results against
the CNN-based depth completion models with lower computing loads and parameters
on the NYU Depth V2 and KITTI DC datasets.