Xuezhi Xiang , Wei Li , Yao Wang , Abdulmotaleb El Saddik
{"title":"Self-supervised monocular depth estimation with self-distillation and dense skip connection","authors":"Xuezhi Xiang , Wei Li , Yao Wang , Abdulmotaleb El Saddik","doi":"10.1016/j.cviu.2024.104048","DOIUrl":null,"url":null,"abstract":"<div><p>Monocular depth estimation (MDE) is crucial in a wide range of applications, including robotics, autonomous driving and virtual reality. Self-supervised monocular depth estimation has emerged as a promising MDE approach without requiring hard-to-obtain depth labels during training, and multi-scale photometric loss is widely used for self-supervised monocular depth estimation as the self-supervised signal. However, multi-photometric loss is a weak training signal and might disturb the good intermediate features representation. In this paper, we propose a successive depth map self-distillation(SDM-SD) loss, which combines with the single-scale photometric loss to replace the multi-scale photometric loss. Moreover, considering that multi-stage feature representations are essential for dense prediction tasks such as depth estimation, we also propose a dense skip connection, which can efficiently fuse the intermediate features of the encoder and fully utilize them in each stage of the decoder in our encoder–decoder architecture. By applying successive depth map self-distillation loss and dense skip connection, our proposed method can achieve state-of-the-art performance on the KITTI benchmark, and exhibit the best generalization ability on the challenging indoor dataset NYUv2 dataset.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224001292","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Monocular depth estimation (MDE) is crucial in a wide range of applications, including robotics, autonomous driving and virtual reality. Self-supervised monocular depth estimation has emerged as a promising MDE approach without requiring hard-to-obtain depth labels during training, and multi-scale photometric loss is widely used for self-supervised monocular depth estimation as the self-supervised signal. However, multi-photometric loss is a weak training signal and might disturb the good intermediate features representation. In this paper, we propose a successive depth map self-distillation(SDM-SD) loss, which combines with the single-scale photometric loss to replace the multi-scale photometric loss. Moreover, considering that multi-stage feature representations are essential for dense prediction tasks such as depth estimation, we also propose a dense skip connection, which can efficiently fuse the intermediate features of the encoder and fully utilize them in each stage of the decoder in our encoder–decoder architecture. By applying successive depth map self-distillation loss and dense skip connection, our proposed method can achieve state-of-the-art performance on the KITTI benchmark, and exhibit the best generalization ability on the challenging indoor dataset NYUv2 dataset.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems