Color and Geometric Contrastive Learning Based Intra-Frame Supervision for Self-Supervised Monocular Depth Estimation

IF 3.2 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Signal Processing Letters Pub Date : 2024-10-14 DOI:10.1109/LSP.2024.3480032

Yanbo Gao;Xianye Wu;Shuai Li;Xun Cai;Chuankun Li

{"title":"Color and Geometric Contrastive Learning Based Intra-Frame Supervision for Self-Supervised Monocular Depth Estimation","authors":"Yanbo Gao;Xianye Wu;Shuai Li;Xun Cai;Chuankun Li","doi":"10.1109/LSP.2024.3480032","DOIUrl":null,"url":null,"abstract":"In recent years, self-supervised monocular depth estimation has become popular due to its advantage in estimating the depth without the need of groundtruth depth labels. Instead, it takes an inter-frame supervision using depth based view synthesis to reconstruct temporal adjacent frames to indirectly supervise the generated depth. However, such supervision weakens the depth estimation at temporal incoherent regions containing small changes among consecutive frames. To overcome the above problem, we propose a color and geometric contrastive learning based intra-frame supervision framework to enhance self-supervised monocular depth estimation. Color-contrastive learning is proposed to guide the network to learn color invariant features considering color information is irrelevant to depth data. To improve the local details of the learned feature, a pixel-level contrastive learning is further used to optimize the learning. In view that the depth estimation, as a pixel-level task, is sensitive to the geometric transformation, geometric-contrastive learning is developed using an inverse geometric transformation to learn features that are equivariant to the geometric data augmentation. A local plane guidance layer (LPG) with contrastive learning is further used to decompose the geometric information and enhance the geometric contrastive learning. Experiments demonstrate that the proposed method achieves the best result compared to the state-of-the-art methods in all tested quality metrics, with the largest improvement of 22.8% over baseline Monodepth2 and 3.2% over Monovit, in terms of SqRel reduction.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"2940-2944"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10716467/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, self-supervised monocular depth estimation has become popular due to its advantage in estimating the depth without the need of groundtruth depth labels. Instead, it takes an inter-frame supervision using depth based view synthesis to reconstruct temporal adjacent frames to indirectly supervise the generated depth. However, such supervision weakens the depth estimation at temporal incoherent regions containing small changes among consecutive frames. To overcome the above problem, we propose a color and geometric contrastive learning based intra-frame supervision framework to enhance self-supervised monocular depth estimation. Color-contrastive learning is proposed to guide the network to learn color invariant features considering color information is irrelevant to depth data. To improve the local details of the learned feature, a pixel-level contrastive learning is further used to optimize the learning. In view that the depth estimation, as a pixel-level task, is sensitive to the geometric transformation, geometric-contrastive learning is developed using an inverse geometric transformation to learn features that are equivariant to the geometric data augmentation. A local plane guidance layer (LPG) with contrastive learning is further used to decompose the geometric information and enhance the geometric contrastive learning. Experiments demonstrate that the proposed method achieves the best result compared to the state-of-the-art methods in all tested quality metrics, with the largest improvement of 22.8% over baseline Monodepth2 and 3.2% over Monovit, in terms of SqRel reduction.

查看原文本刊更多论文

基于色彩和几何对比学习的帧内监督，实现自我监督式单目深度估算

近年来，自监督单目深度估算因其无需真实深度标签即可估算深度的优势而备受青睐。然而，自监督单目深度估算在时间相邻区域的深度估算会受到影响。然而，这种监督会削弱在包含连续帧间微小变化的时间不连贯区域的深度估计。为了克服上述问题，我们提出了一种基于色彩和几何对比学习的帧内监督框架，以增强自我监督的单目深度估计。考虑到颜色信息与深度数据无关，我们提出了颜色对比学习来引导网络学习颜色不变特征。为了改善所学特征的局部细节，进一步使用像素级对比学习来优化学习。鉴于作为像素级任务的深度估算对几何变换非常敏感，因此利用反几何变换开发了几何对比学习，以学习与几何数据增强等价的特征。具有对比学习功能的局部平面引导层（LPG）被进一步用于分解几何信息和增强几何对比学习。实验表明，在所有测试的质量指标中，与最先进的方法相比，所提出的方法都取得了最佳效果，在 SqRel 减少方面，与基线 Monodepth2 相比，最大改进幅度为 22.8%，与 Monovit 相比，最大改进幅度为 3.2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Signal Processing Letters 工程技术-工程：电子与电气

CiteScore

7.40

自引率

12.80%

发文量

339

审稿时长

2.8 months

期刊介绍： The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.