DepthNet: A Recurrent Neural Network Architecture for Monocular Depth Prediction

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Pub Date : 2018-06-01 DOI:10.1109/CVPRW.2018.00066

Arun C. S. Kumar, S. Bhandarkar, Mukta Prasad

{"title":"DepthNet: A Recurrent Neural Network Architecture for Monocular Depth Prediction","authors":"Arun C. S. Kumar, S. Bhandarkar, Mukta Prasad","doi":"10.1109/CVPRW.2018.00066","DOIUrl":null,"url":null,"abstract":"Predicting the depth map of a scene is often a vital component of monocular SLAM pipelines. Depth prediction is fundamentally ill-posed due to the inherent ambiguity in the scene formation process. In recent times, convolutional neural networks (CNNs) that exploit scene geometric constraints have been explored extensively for supervised single-view depth prediction and semi-supervised 2-view depth prediction. In this paper we explore whether recurrent neural networks (RNNs) can learn spatio-temporally accurate monocular depth prediction from video sequences, even without explicit definition of the inter-frame geometric consistency or pose supervision. To this end, we propose a novel convolutional LSTM (ConvLSTM)-based network architecture for depth prediction from a monocular video sequence. In the proposed ConvLSTM network architecture, we harness the ability of long short-term memory (LSTM)-based RNNs to reason sequentially and predict the depth map for an image frame as a function of the appearances of scene objects in the image frame as well as image frames in its temporal neighborhood. In addition, the proposed ConvLSTM network is also shown to be able to make depth predictions for future or unseen image frame(s). We demonstrate the depth prediction performance of the proposed ConvLSTM network on the KITTI dataset and show that it gives results that are superior in terms of accuracy to those obtained via depth-supervised and self-supervised methods and comparable to those generated by state-of-the-art pose-supervised methods.","PeriodicalId":150600,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"75","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPRW.2018.00066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 75

Abstract

Predicting the depth map of a scene is often a vital component of monocular SLAM pipelines. Depth prediction is fundamentally ill-posed due to the inherent ambiguity in the scene formation process. In recent times, convolutional neural networks (CNNs) that exploit scene geometric constraints have been explored extensively for supervised single-view depth prediction and semi-supervised 2-view depth prediction. In this paper we explore whether recurrent neural networks (RNNs) can learn spatio-temporally accurate monocular depth prediction from video sequences, even without explicit definition of the inter-frame geometric consistency or pose supervision. To this end, we propose a novel convolutional LSTM (ConvLSTM)-based network architecture for depth prediction from a monocular video sequence. In the proposed ConvLSTM network architecture, we harness the ability of long short-term memory (LSTM)-based RNNs to reason sequentially and predict the depth map for an image frame as a function of the appearances of scene objects in the image frame as well as image frames in its temporal neighborhood. In addition, the proposed ConvLSTM network is also shown to be able to make depth predictions for future or unseen image frame(s). We demonstrate the depth prediction performance of the proposed ConvLSTM network on the KITTI dataset and show that it gives results that are superior in terms of accuracy to those obtained via depth-supervised and self-supervised methods and comparable to those generated by state-of-the-art pose-supervised methods.

查看原文本刊更多论文

深度网络:用于单目深度预测的递归神经网络架构

预测场景的深度图通常是单目SLAM管道的重要组成部分。由于场景形成过程中固有的模糊性，深度预测从根本上说是病态的。近年来，利用场景几何约束的卷积神经网络(cnn)在有监督的单视图深度预测和半监督的2视图深度预测中得到了广泛的研究。在本文中，我们探讨了循环神经网络(RNNs)是否可以在没有明确定义帧间几何一致性或姿态监督的情况下，从视频序列中学习时空精确的单目深度预测。为此，我们提出了一种新颖的基于卷积LSTM (ConvLSTM)的网络架构，用于单目视频序列的深度预测。在提出的ConvLSTM网络架构中，我们利用基于长短期记忆(LSTM)的rnn的能力，根据图像帧中场景物体的外观及其时间邻域图像帧的外观顺序推理和预测图像帧的深度图。此外，所提出的ConvLSTM网络也被证明能够对未来或未见过的图像帧进行深度预测。我们在KITTI数据集上展示了所提出的ConvLSTM网络的深度预测性能，并表明它给出的结果在准确性方面优于通过深度监督和自监督方法获得的结果，并且与最先进的姿势监督方法产生的结果相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

自引率

0.00%

发文量