来自未知摄像机的自监督单目深度学习:利用原始数据的力量

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Xiaofei Qin , Yongchao Zhu , Lin Wang , Xuedian Zhang , Changxiang He , Qiulei Dong
{"title":"来自未知摄像机的自监督单目深度学习:利用原始数据的力量","authors":"Xiaofei Qin ,&nbsp;Yongchao Zhu ,&nbsp;Lin Wang ,&nbsp;Xuedian Zhang ,&nbsp;Changxiang He ,&nbsp;Qiulei Dong","doi":"10.1016/j.imavis.2025.105505","DOIUrl":null,"url":null,"abstract":"<div><div>Self-supervised monocular depth estimation from wild videos with unknown camera intrinsics is a practical and challenging task in computer vision. Most of the existing methods in literature employed a camera decoder and a pose decoder to estimate camera intrinsics and poses respectively, however, their performances would be degraded significantly in many complex scenarios with severe noise and large camera rotations. To address this problem, we propose a novel self-supervised monocular depth estimation method, which could be trained from wild videos with a joint optimization strategy for simultaneously estimating camera intrinsics and poses. In the proposed method, a depth encoder is employed to learn scene depth features, and then by taking these features as inputs, a Neighborhood Influence Module (NIM) is designed for predicting each pixel’s depth by fusing the depths of its neighboring pixels, which could explicitly enforce the depth accuracy. In addition, a knowledge distillation mechanism is introduced to learn a lightweight depth encoder from a large-scale depth encoder, for achieving a balance between computational speed and accuracy. Experimental results on four public datasets demonstrate that the proposed method outperforms some state-of-the-art methods in most cases. Moreover, once the proposed method is trained with a mixed set of different datasets, its performance would be further boosted in comparison to the proposed method trained with each involved single dataset. Codes are available at: <span><span>https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105505"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-supervised monocular depth learning from unknown cameras: Leveraging the power of raw data\",\"authors\":\"Xiaofei Qin ,&nbsp;Yongchao Zhu ,&nbsp;Lin Wang ,&nbsp;Xuedian Zhang ,&nbsp;Changxiang He ,&nbsp;Qiulei Dong\",\"doi\":\"10.1016/j.imavis.2025.105505\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Self-supervised monocular depth estimation from wild videos with unknown camera intrinsics is a practical and challenging task in computer vision. Most of the existing methods in literature employed a camera decoder and a pose decoder to estimate camera intrinsics and poses respectively, however, their performances would be degraded significantly in many complex scenarios with severe noise and large camera rotations. To address this problem, we propose a novel self-supervised monocular depth estimation method, which could be trained from wild videos with a joint optimization strategy for simultaneously estimating camera intrinsics and poses. In the proposed method, a depth encoder is employed to learn scene depth features, and then by taking these features as inputs, a Neighborhood Influence Module (NIM) is designed for predicting each pixel’s depth by fusing the depths of its neighboring pixels, which could explicitly enforce the depth accuracy. In addition, a knowledge distillation mechanism is introduced to learn a lightweight depth encoder from a large-scale depth encoder, for achieving a balance between computational speed and accuracy. Experimental results on four public datasets demonstrate that the proposed method outperforms some state-of-the-art methods in most cases. Moreover, once the proposed method is trained with a mixed set of different datasets, its performance would be further boosted in comparison to the proposed method trained with each involved single dataset. Codes are available at: <span><span>https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"157 \",\"pages\":\"Article 105505\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625000939\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000939","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

自监督单目深度估计是计算机视觉中一项实用而又具有挑战性的任务。现有文献中大多数方法分别采用相机解码器和姿态解码器来估计相机的特性和姿态,但在许多复杂的场景下,在严重的噪声和大的摄像机旋转下,它们的性能会明显下降。为了解决这一问题,我们提出了一种新的自监督单目深度估计方法,该方法可以从野生视频中进行训练,并采用联合优化策略同时估计摄像机的内在特征和姿态。该方法利用深度编码器学习场景深度特征,然后以这些特征为输入,设计邻域影响模块(NIM),通过融合相邻像素的深度来预测每个像素的深度,从而显式地增强深度精度。此外,引入知识蒸馏机制,从大规模深度编码器学习轻量级深度编码器,实现了计算速度和精度之间的平衡。在四个公共数据集上的实验结果表明,该方法在大多数情况下优于一些最先进的方法。此外,一旦使用不同数据集的混合集训练所提出的方法,其性能将比使用每个涉及的单个数据集训练所提出的方法进一步提高。代码可在https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Self-supervised monocular depth learning from unknown cameras: Leveraging the power of raw data
Self-supervised monocular depth estimation from wild videos with unknown camera intrinsics is a practical and challenging task in computer vision. Most of the existing methods in literature employed a camera decoder and a pose decoder to estimate camera intrinsics and poses respectively, however, their performances would be degraded significantly in many complex scenarios with severe noise and large camera rotations. To address this problem, we propose a novel self-supervised monocular depth estimation method, which could be trained from wild videos with a joint optimization strategy for simultaneously estimating camera intrinsics and poses. In the proposed method, a depth encoder is employed to learn scene depth features, and then by taking these features as inputs, a Neighborhood Influence Module (NIM) is designed for predicting each pixel’s depth by fusing the depths of its neighboring pixels, which could explicitly enforce the depth accuracy. In addition, a knowledge distillation mechanism is introduced to learn a lightweight depth encoder from a large-scale depth encoder, for achieving a balance between computational speed and accuracy. Experimental results on four public datasets demonstrate that the proposed method outperforms some state-of-the-art methods in most cases. Moreover, once the proposed method is trained with a mixed set of different datasets, its performance would be further boosted in comparison to the proposed method trained with each involved single dataset. Codes are available at: https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Image and Vision Computing
Image and Vision Computing 工程技术-工程:电子与电气
CiteScore
8.50
自引率
8.50%
发文量
143
审稿时长
7.8 months
期刊介绍: Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信