Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Estimation

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-08-16 DOI:10.1109/TPAMI.2024.3444912

Mu Hu;Wei Yin;Chi Zhang;Zhipeng Cai;Xiaoxiao Long;Hao Chen;Kaixuan Wang;Gang Yu;Chunhua Shen;Shaojie Shen

{"title":"Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Estimation","authors":"Mu Hu;Wei Yin;Chi Zhang;Zhipeng Cai;Xiaoxiao Long;Hao Chen;Kaixuan Wang;Gang Yu;Chunhua Shen;Shaojie Shen","doi":"10.1109/TPAMI.2024.3444912","DOIUrl":null,"url":null,"abstract":"We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric depth and surface normal estimation from single images, critical for accurate 3D recovery. Depth and normal estimation, though complementary, present distinct challenges. State-of-the-art monocular depth methods achieve zero-shot generalization through affine-invariant depths, but fail to recover real-world metric scale. Conversely, current normal estimation techniques struggle with zero-shot performance due to insufficient labeled data. We propose targeted solutions for both metric depth and normal estimation. For metric depth, we present a canonical camera space transformation module that resolves metric ambiguity across various camera models and large-scale datasets, which can be easily integrated into existing monocular models. For surface normal estimation, we introduce a joint depth-normal optimization module that leverages diverse data from metric depth, allowing normal estimators to improve beyond traditional labels. Our model, trained on over 16 million images from thousands of camera models with varied annotations, excels in zero-shot generalization to new camera settings. As shown in Fig. 1, It ranks the 1st in multiple zero-shot and standard benchmarks for metric depth and surface normal prediction. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our model also relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. Such applications highlight the versatility of Metric3D v2 models as geometric foundation models.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10579-10596"},"PeriodicalIF":18.6000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10638254/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric depth and surface normal estimation from single images, critical for accurate 3D recovery. Depth and normal estimation, though complementary, present distinct challenges. State-of-the-art monocular depth methods achieve zero-shot generalization through affine-invariant depths, but fail to recover real-world metric scale. Conversely, current normal estimation techniques struggle with zero-shot performance due to insufficient labeled data. We propose targeted solutions for both metric depth and normal estimation. For metric depth, we present a canonical camera space transformation module that resolves metric ambiguity across various camera models and large-scale datasets, which can be easily integrated into existing monocular models. For surface normal estimation, we introduce a joint depth-normal optimization module that leverages diverse data from metric depth, allowing normal estimators to improve beyond traditional labels. Our model, trained on over 16 million images from thousands of camera models with varied annotations, excels in zero-shot generalization to new camera settings. As shown in Fig. 1, It ranks the 1st in multiple zero-shot and standard benchmarks for metric depth and surface normal prediction. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our model also relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. Such applications highlight the versatility of Metric3D v2 models as geometric foundation models.

查看原文本刊更多论文

Metric3D v2：用于零镜头度量深度和表面法线估算的多功能单目几何基础模型。

我们介绍了 Metric3D v2，这是一种几何基础模型，用于从单幅图像中进行零镜头度量深度和表面法线估算，这对于度量三维复原至关重要。虽然深度和法线在几何上相互关联、互为补充，但它们也面临着不同的挑战。最先进的（SoTA）单目深度方法通过学习仿射不变深度来实现零点泛化，但无法恢复真实世界的度量。同时，由于缺乏大规模标注数据，SoTA 正常估计方法的零镜头性能有限。为了解决这些问题，我们提出了度量深度估计和表面法线估计的解决方案。在度量深度估算方面，我们发现零镜头单视角模型的关键在于解决来自各种相机模型和大规模数据训练的度量模糊性。我们提出了一个典型相机空间转换模块，它明确地解决了模糊性问题，并能毫不费力地插入到现有的单目模型中。对于表面法线估计，我们提出了一个深度-法线联合优化模块，从度量深度中提炼出多样化的数据知识，使法线估计器能够学习法线标签以外的知识。有了这些模块，我们的深度-法线模型就能稳定地训练来自成千上万不同类型注释的相机模型的 1600 多万张图像，从而实现对未见相机设置的野外图像的零误差泛化。目前，我们的方法在公制深度、仿射不变深度以及表面法线预测的各种零拍摄和非零拍摄基准测试中排名第一，如图 1 所示。值得注意的是，在包括 NYUv2 和 KITTI 在内的各种深度基准测试中，我们超越了最新的 MarigoldDepth 和 DepthAnything。我们的方法能够在随机收集的互联网图像上准确恢复度量三维结构，为可信的单图像计量铺平了道路。我们的潜在优势还可延伸到下游任务，只需插入我们的模型，这些任务就能得到显著改善。例如，我们的模型解决了单目-SLAM 的尺度漂移问题（图 3），从而实现了高质量的度量尺度密集映射。这些应用凸显了 Metric3D v2 模型作为几何基础模型的多功能性。我们的项目页面是 https://JUGGHM.github.io/Metric3Dv2。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量