Self-supervised monocular depth learning from unknown cameras: Leveraging the power of raw data

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-03-19 DOI:10.1016/j.imavis.2025.105505

Xiaofei Qin , Yongchao Zhu , Lin Wang , Xuedian Zhang , Changxiang He , Qiulei Dong

{"title":"Self-supervised monocular depth learning from unknown cameras: Leveraging the power of raw data","authors":"Xiaofei Qin , Yongchao Zhu , Lin Wang , Xuedian Zhang , Changxiang He , Qiulei Dong","doi":"10.1016/j.imavis.2025.105505","DOIUrl":null,"url":null,"abstract":"<div><div>Self-supervised monocular depth estimation from wild videos with unknown camera intrinsics is a practical and challenging task in computer vision. Most of the existing methods in literature employed a camera decoder and a pose decoder to estimate camera intrinsics and poses respectively, however, their performances would be degraded significantly in many complex scenarios with severe noise and large camera rotations. To address this problem, we propose a novel self-supervised monocular depth estimation method, which could be trained from wild videos with a joint optimization strategy for simultaneously estimating camera intrinsics and poses. In the proposed method, a depth encoder is employed to learn scene depth features, and then by taking these features as inputs, a Neighborhood Influence Module (NIM) is designed for predicting each pixel’s depth by fusing the depths of its neighboring pixels, which could explicitly enforce the depth accuracy. In addition, a knowledge distillation mechanism is introduced to learn a lightweight depth encoder from a large-scale depth encoder, for achieving a balance between computational speed and accuracy. Experimental results on four public datasets demonstrate that the proposed method outperforms some state-of-the-art methods in most cases. Moreover, once the proposed method is trained with a mixed set of different datasets, its performance would be further boosted in comparison to the proposed method trained with each involved single dataset. Codes are available at: <span><span>https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105505"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000939","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Self-supervised monocular depth estimation from wild videos with unknown camera intrinsics is a practical and challenging task in computer vision. Most of the existing methods in literature employed a camera decoder and a pose decoder to estimate camera intrinsics and poses respectively, however, their performances would be degraded significantly in many complex scenarios with severe noise and large camera rotations. To address this problem, we propose a novel self-supervised monocular depth estimation method, which could be trained from wild videos with a joint optimization strategy for simultaneously estimating camera intrinsics and poses. In the proposed method, a depth encoder is employed to learn scene depth features, and then by taking these features as inputs, a Neighborhood Influence Module (NIM) is designed for predicting each pixel’s depth by fusing the depths of its neighboring pixels, which could explicitly enforce the depth accuracy. In addition, a knowledge distillation mechanism is introduced to learn a lightweight depth encoder from a large-scale depth encoder, for achieving a balance between computational speed and accuracy. Experimental results on four public datasets demonstrate that the proposed method outperforms some state-of-the-art methods in most cases. Moreover, once the proposed method is trained with a mixed set of different datasets, its performance would be further boosted in comparison to the proposed method trained with each involved single dataset. Codes are available at: https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth.

查看原文本刊更多论文

来自未知摄像机的自监督单目深度学习：利用原始数据的力量

自监督单目深度估计是计算机视觉中一项实用而又具有挑战性的任务。现有文献中大多数方法分别采用相机解码器和姿态解码器来估计相机的特性和姿态，但在许多复杂的场景下，在严重的噪声和大的摄像机旋转下，它们的性能会明显下降。为了解决这一问题，我们提出了一种新的自监督单目深度估计方法，该方法可以从野生视频中进行训练，并采用联合优化策略同时估计摄像机的内在特征和姿态。该方法利用深度编码器学习场景深度特征，然后以这些特征为输入，设计邻域影响模块（NIM），通过融合相邻像素的深度来预测每个像素的深度，从而显式地增强深度精度。此外，引入知识蒸馏机制，从大规模深度编码器学习轻量级深度编码器，实现了计算速度和精度之间的平衡。在四个公共数据集上的实验结果表明，该方法在大多数情况下优于一些最先进的方法。此外，一旦使用不同数据集的混合集训练所提出的方法，其性能将比使用每个涉及的单个数据集训练所提出的方法进一步提高。代码可在https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.