Xiaofei Qin , Yongchao Zhu , Lin Wang , Xuedian Zhang , Changxiang He , Qiulei Dong
{"title":"Self-supervised monocular depth learning from unknown cameras: Leveraging the power of raw data","authors":"Xiaofei Qin , Yongchao Zhu , Lin Wang , Xuedian Zhang , Changxiang He , Qiulei Dong","doi":"10.1016/j.imavis.2025.105505","DOIUrl":null,"url":null,"abstract":"<div><div>Self-supervised monocular depth estimation from wild videos with unknown camera intrinsics is a practical and challenging task in computer vision. Most of the existing methods in literature employed a camera decoder and a pose decoder to estimate camera intrinsics and poses respectively, however, their performances would be degraded significantly in many complex scenarios with severe noise and large camera rotations. To address this problem, we propose a novel self-supervised monocular depth estimation method, which could be trained from wild videos with a joint optimization strategy for simultaneously estimating camera intrinsics and poses. In the proposed method, a depth encoder is employed to learn scene depth features, and then by taking these features as inputs, a Neighborhood Influence Module (NIM) is designed for predicting each pixel’s depth by fusing the depths of its neighboring pixels, which could explicitly enforce the depth accuracy. In addition, a knowledge distillation mechanism is introduced to learn a lightweight depth encoder from a large-scale depth encoder, for achieving a balance between computational speed and accuracy. Experimental results on four public datasets demonstrate that the proposed method outperforms some state-of-the-art methods in most cases. Moreover, once the proposed method is trained with a mixed set of different datasets, its performance would be further boosted in comparison to the proposed method trained with each involved single dataset. Codes are available at: <span><span>https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105505"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000939","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Self-supervised monocular depth estimation from wild videos with unknown camera intrinsics is a practical and challenging task in computer vision. Most of the existing methods in literature employed a camera decoder and a pose decoder to estimate camera intrinsics and poses respectively, however, their performances would be degraded significantly in many complex scenarios with severe noise and large camera rotations. To address this problem, we propose a novel self-supervised monocular depth estimation method, which could be trained from wild videos with a joint optimization strategy for simultaneously estimating camera intrinsics and poses. In the proposed method, a depth encoder is employed to learn scene depth features, and then by taking these features as inputs, a Neighborhood Influence Module (NIM) is designed for predicting each pixel’s depth by fusing the depths of its neighboring pixels, which could explicitly enforce the depth accuracy. In addition, a knowledge distillation mechanism is introduced to learn a lightweight depth encoder from a large-scale depth encoder, for achieving a balance between computational speed and accuracy. Experimental results on four public datasets demonstrate that the proposed method outperforms some state-of-the-art methods in most cases. Moreover, once the proposed method is trained with a mixed set of different datasets, its performance would be further boosted in comparison to the proposed method trained with each involved single dataset. Codes are available at: https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.