Geometry-Aware Self-Supervised Indoor 360$^{\circ }$ Depth Estimation via Asymmetric Dual-Domain Collaborative Learning

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI:10.1109/TMM.2025.3535340

Xu Wang;Ziyan He;Qiudan Zhang;You Yang;Tiesong Zhao;Jianmin Jiang

{"title":"Geometry-Aware Self-Supervised Indoor 360$^{\\circ }$ Depth Estimation via Asymmetric Dual-Domain Collaborative Learning","authors":"Xu Wang;Ziyan He;Qiudan Zhang;You Yang;Tiesong Zhao;Jianmin Jiang","doi":"10.1109/TMM.2025.3535340","DOIUrl":null,"url":null,"abstract":"Being able to estimate monocular depth for spherical panoramas is of fundamental importance in 3D scene perception. However, spherical distortion severely limits the effectiveness of vanilla convolutions. To push the envelope of accuracy, recent approaches attempt to utilize Tangent projection (TP) to estimate the depth of <inline-formula><tex-math>$360 ^{\\circ }$</tex-math></inline-formula> images. Yet, these methods still suffer from discrepancies and inconsistencies among patch-wise tangent images, as well as the lack of accurate ground truth depth maps under a supervised fashion. In this paper, we propose a geometry-aware self-supervised <inline-formula><tex-math>$360 ^{\\circ }$</tex-math></inline-formula> image depth estimation methodology that explores the complementary advantages of TP and Equirectangular projection (ERP) by an asymmetric dual-domain collaborative learning strategy. Especially, we first develop a lightweight asymmetric dual-domain depth estimation network, which enables to aggregate depth-related features from a single TP domain, and then produce depth distributions of the TP and ERP domains via collaborative learning. This effectively mitigates stitching artifacts and preserves fine details in depth inference without overspending model parameters. In addition, a frequent-spatial feature concentration module is devised to simultaneously capture non-local Fourier features and local spatial features, such that facilitating the efficient exploration of monocular depth cues. Moreover, we introduce a geometric structural alignment module to further improve geometric structural consistency among tangent images. Extensive experiments illustrate that our designed approach outperforms existing self-supervised <inline-formula><tex-math>$360 ^{\\circ }$</tex-math></inline-formula> depth estimation methods on three publicly available benchmark datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3224-3237"},"PeriodicalIF":9.7000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855624/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Being able to estimate monocular depth for spherical panoramas is of fundamental importance in 3D scene perception. However, spherical distortion severely limits the effectiveness of vanilla convolutions. To push the envelope of accuracy, recent approaches attempt to utilize Tangent projection (TP) to estimate the depth of

$360 ^{\circ }$

images. Yet, these methods still suffer from discrepancies and inconsistencies among patch-wise tangent images, as well as the lack of accurate ground truth depth maps under a supervised fashion. In this paper, we propose a geometry-aware self-supervised

$360 ^{\circ }$

image depth estimation methodology that explores the complementary advantages of TP and Equirectangular projection (ERP) by an asymmetric dual-domain collaborative learning strategy. Especially, we first develop a lightweight asymmetric dual-domain depth estimation network, which enables to aggregate depth-related features from a single TP domain, and then produce depth distributions of the TP and ERP domains via collaborative learning. This effectively mitigates stitching artifacts and preserves fine details in depth inference without overspending model parameters. In addition, a frequent-spatial feature concentration module is devised to simultaneously capture non-local Fourier features and local spatial features, such that facilitating the efficient exploration of monocular depth cues. Moreover, we introduce a geometric structural alignment module to further improve geometric structural consistency among tangent images. Extensive experiments illustrate that our designed approach outperforms existing self-supervised

$360 ^{\circ }$

depth estimation methods on three publicly available benchmark datasets.

查看原文本刊更多论文

基于非对称双域协同学习的几何感知自监督室内360$^{\circ}$深度估计

能够估计球面全景的单目深度在3D场景感知中是至关重要的。然而，球面畸变严重限制了普通卷积的有效性。为了提高精度，最近的方法试图利用切线投影（TP）来估计$360 ^{\circ}$图像的深度。然而，这些方法仍然存在着局部切线图像之间的差异和不一致，以及在监督方式下缺乏准确的地真深度图。在本文中，我们提出了一种几何感知的自监督$360 ^{\circ}$图像深度估计方法，该方法通过非对称双域协作学习策略探索了TP和等矩形投影（ERP）的互补优势。特别是，我们首先开发了一个轻量级的非对称双域深度估计网络，该网络能够从单个TP域中聚合深度相关特征，然后通过协作学习生成TP和ERP域的深度分布。这有效地减轻了拼接伪影，并在深度推理中保留了精细的细节，而不会占用过多的模型参数。此外，设计了一个频率空间特征集中模块，以同时捕获非局部傅里叶特征和局部空间特征，从而促进单眼深度线索的有效探索。此外，我们还引入了几何结构对齐模块，以进一步提高切线图像之间的几何结构一致性。大量实验表明，我们设计的方法在三个公开可用的基准数据集上优于现有的自监督$360 ^{\circ}$深度估计方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.