H. Cao, Chao Wang, Ping Wang, Qingquan Zou, Xiao Xiao
{"title":"基于相对运动的单目视频无监督深度估计","authors":"H. Cao, Chao Wang, Ping Wang, Qingquan Zou, Xiao Xiao","doi":"10.1145/3297067.3297094","DOIUrl":null,"url":null,"abstract":"In this paper, we present an unsupervised learning based approach to conduct depth estimation for monocular camera video images. Our system is formed by two convolutional neural networks (CNNs). A Depth-net is applied to estimate the depth information of objects in the target frame, and a Pose-net tends to estimate the relative motion of the camera from multiple adjacent video frames. Different from most previous works, which normally assume that all objects captured by the images are static so that a frame-level camera pose is generated by the Pose-net, we take into account of the motions of all objects and require the Pose-net to estimate the pixel-level relative pose. The outputs of the two networks are then combined to formulate a synthetic view loss function, through which the two CNNs are optimized to provide accurate depth estimation. Experimental test results show that our method can provide better performance than most conventional approaches.","PeriodicalId":340004,"journal":{"name":"International Conference on Signal Processing and Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unsupervised Depth Estimation from Monocular Video based on Relative Motion\",\"authors\":\"H. Cao, Chao Wang, Ping Wang, Qingquan Zou, Xiao Xiao\",\"doi\":\"10.1145/3297067.3297094\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present an unsupervised learning based approach to conduct depth estimation for monocular camera video images. Our system is formed by two convolutional neural networks (CNNs). A Depth-net is applied to estimate the depth information of objects in the target frame, and a Pose-net tends to estimate the relative motion of the camera from multiple adjacent video frames. Different from most previous works, which normally assume that all objects captured by the images are static so that a frame-level camera pose is generated by the Pose-net, we take into account of the motions of all objects and require the Pose-net to estimate the pixel-level relative pose. The outputs of the two networks are then combined to formulate a synthetic view loss function, through which the two CNNs are optimized to provide accurate depth estimation. Experimental test results show that our method can provide better performance than most conventional approaches.\",\"PeriodicalId\":340004,\"journal\":{\"name\":\"International Conference on Signal Processing and Machine Learning\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Signal Processing and Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3297067.3297094\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing and Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3297067.3297094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Unsupervised Depth Estimation from Monocular Video based on Relative Motion
In this paper, we present an unsupervised learning based approach to conduct depth estimation for monocular camera video images. Our system is formed by two convolutional neural networks (CNNs). A Depth-net is applied to estimate the depth information of objects in the target frame, and a Pose-net tends to estimate the relative motion of the camera from multiple adjacent video frames. Different from most previous works, which normally assume that all objects captured by the images are static so that a frame-level camera pose is generated by the Pose-net, we take into account of the motions of all objects and require the Pose-net to estimate the pixel-level relative pose. The outputs of the two networks are then combined to formulate a synthetic view loss function, through which the two CNNs are optimized to provide accurate depth estimation. Experimental test results show that our method can provide better performance than most conventional approaches.