H. Cao, Chao Wang, Ping Wang, Qingquan Zou, Xiao Xiao
{"title":"Unsupervised Depth Estimation from Monocular Video based on Relative Motion","authors":"H. Cao, Chao Wang, Ping Wang, Qingquan Zou, Xiao Xiao","doi":"10.1145/3297067.3297094","DOIUrl":null,"url":null,"abstract":"In this paper, we present an unsupervised learning based approach to conduct depth estimation for monocular camera video images. Our system is formed by two convolutional neural networks (CNNs). A Depth-net is applied to estimate the depth information of objects in the target frame, and a Pose-net tends to estimate the relative motion of the camera from multiple adjacent video frames. Different from most previous works, which normally assume that all objects captured by the images are static so that a frame-level camera pose is generated by the Pose-net, we take into account of the motions of all objects and require the Pose-net to estimate the pixel-level relative pose. The outputs of the two networks are then combined to formulate a synthetic view loss function, through which the two CNNs are optimized to provide accurate depth estimation. Experimental test results show that our method can provide better performance than most conventional approaches.","PeriodicalId":340004,"journal":{"name":"International Conference on Signal Processing and Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing and Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3297067.3297094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we present an unsupervised learning based approach to conduct depth estimation for monocular camera video images. Our system is formed by two convolutional neural networks (CNNs). A Depth-net is applied to estimate the depth information of objects in the target frame, and a Pose-net tends to estimate the relative motion of the camera from multiple adjacent video frames. Different from most previous works, which normally assume that all objects captured by the images are static so that a frame-level camera pose is generated by the Pose-net, we take into account of the motions of all objects and require the Pose-net to estimate the pixel-level relative pose. The outputs of the two networks are then combined to formulate a synthetic view loss function, through which the two CNNs are optimized to provide accurate depth estimation. Experimental test results show that our method can provide better performance than most conventional approaches.