Self-Supervised Multi-Camera Collaborative Depth Prediction With Latent Diffusion Models

IF 7.9 1区工程技术 Q1 ENGINEERING, CIVIL

IEEE Transactions on Intelligent Transportation Systems Pub Date : 2025-06-02 DOI:10.1109/TITS.2025.3571027

Jialei Xu;Xianming Liu;Yuanchao Bai;Junjun Jiang;Xiangyang Ji

{"title":"Self-Supervised Multi-Camera Collaborative Depth Prediction With Latent Diffusion Models","authors":"Jialei Xu;Xianming Liu;Yuanchao Bai;Junjun Jiang;Xiangyang Ji","doi":"10.1109/TITS.2025.3571027","DOIUrl":null,"url":null,"abstract":"Depth map estimation from images is a crucial task in self-driving applications. Existing methods can be categorized into two groups: multi-view stereo and monocular depth estimation. The former requires cameras to have large overlapping areas and a sufficient baseline between them, while the latter that processes each image independently can hardly guarantee the structure consistency between cameras. In this paper, we propose a novel self-supervised multi-camera collaborative depth prediction method with latent diffusion models, which does not require large overlapping areas while maintaining structure consistency between cameras. Specifically, we introduce MCDP, a new generative foundation model for estimating depth attributes for multi-cameras. We formulate the depth estimation as a weighted combination of depth bases, in which the weights are updated iteratively by the recurrent refinement strategy. During the iterative update, the results of depth estimation are compared across cameras, and the information of overlapping areas is propagated to the whole depth maps with the help of basis formulation in diffusion process. We integrate the GRU-based Weight Net into the diffusion process, allowing the refined hidden state to serve as a conditional input to accurately control the next iterative denoising step. Furthermore, by incorporating the proposed depth consistency loss, we ensure structural consistency across cameras, even in regions with minimal overlap. Experimental results on DDAD, NuScenes, Cityscapes, and Waymo Open Datasets demonstrate the superior performance of our method, and show great help for the downstream task.","PeriodicalId":13416,"journal":{"name":"IEEE Transactions on Intelligent Transportation Systems","volume":"26 7","pages":"9609-9624"},"PeriodicalIF":7.9000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Intelligent Transportation Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11021548/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CIVIL","Score":null,"Total":0}

引用次数: 0

Abstract

Depth map estimation from images is a crucial task in self-driving applications. Existing methods can be categorized into two groups: multi-view stereo and monocular depth estimation. The former requires cameras to have large overlapping areas and a sufficient baseline between them, while the latter that processes each image independently can hardly guarantee the structure consistency between cameras. In this paper, we propose a novel self-supervised multi-camera collaborative depth prediction method with latent diffusion models, which does not require large overlapping areas while maintaining structure consistency between cameras. Specifically, we introduce MCDP, a new generative foundation model for estimating depth attributes for multi-cameras. We formulate the depth estimation as a weighted combination of depth bases, in which the weights are updated iteratively by the recurrent refinement strategy. During the iterative update, the results of depth estimation are compared across cameras, and the information of overlapping areas is propagated to the whole depth maps with the help of basis formulation in diffusion process. We integrate the GRU-based Weight Net into the diffusion process, allowing the refined hidden state to serve as a conditional input to accurately control the next iterative denoising step. Furthermore, by incorporating the proposed depth consistency loss, we ensure structural consistency across cameras, even in regions with minimal overlap. Experimental results on DDAD, NuScenes, Cityscapes, and Waymo Open Datasets demonstrate the superior performance of our method, and show great help for the downstream task.

查看原文本刊更多论文

基于潜扩散模型的自监督多相机协同深度预测

从图像中估计深度图是自动驾驶应用中的一项关键任务。现有的深度估计方法可分为两大类：多视角立体估计和单视角深度估计。前者要求相机之间有较大的重叠区域和足够的基线，而后者对每张图像进行独立处理，很难保证相机之间的结构一致性。本文提出了一种新的基于潜扩散模型的自监督多相机协同深度预测方法，该方法不需要大的重叠区域，同时保持了相机之间的结构一致性。具体来说，我们介绍了一种新的多相机深度属性估计生成基础模型MCDP。我们将深度估计表述为深度基的加权组合，其中权重通过循环优化策略迭代更新。在迭代更新过程中，对不同相机的深度估计结果进行比较，并在扩散过程中借助基公式将重叠区域的信息传播到整个深度图中。我们将基于gru的权重网络集成到扩散过程中，允许精炼的隐藏状态作为条件输入来精确控制下一个迭代去噪步骤。此外，通过整合所提出的深度一致性损失，我们确保了相机之间的结构一致性，即使在重叠最小的区域也是如此。在DDAD、NuScenes、cityscape和Waymo开放数据集上的实验结果表明，我们的方法性能优越，对下游任务有很大的帮助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Intelligent Transportation Systems 工程技术-工程：电子与电气

CiteScore

14.80

自引率

12.90%

发文量

1872

审稿时长

7.5 months

期刊介绍： The theoretical, experimental and operational aspects of electrical and electronics engineering and information technologies as applied to Intelligent Transportation Systems (ITS). Intelligent Transportation Systems are defined as those systems utilizing synergistic technologies and systems engineering concepts to develop and improve transportation systems of all kinds. The scope of this interdisciplinary activity includes the promotion, consolidation and coordination of ITS technical activities among IEEE entities, and providing a focus for cooperative activities, both internally and externally.