Jialei Xu;Xianming Liu;Yuanchao Bai;Junjun Jiang;Xiangyang Ji
{"title":"Self-Supervised Multi-Camera Collaborative Depth Prediction With Latent Diffusion Models","authors":"Jialei Xu;Xianming Liu;Yuanchao Bai;Junjun Jiang;Xiangyang Ji","doi":"10.1109/TITS.2025.3571027","DOIUrl":null,"url":null,"abstract":"Depth map estimation from images is a crucial task in self-driving applications. Existing methods can be categorized into two groups: multi-view stereo and monocular depth estimation. The former requires cameras to have large overlapping areas and a sufficient baseline between them, while the latter that processes each image independently can hardly guarantee the structure consistency between cameras. In this paper, we propose a novel self-supervised multi-camera collaborative depth prediction method with latent diffusion models, which does not require large overlapping areas while maintaining structure consistency between cameras. Specifically, we introduce MCDP, a new generative foundation model for estimating depth attributes for multi-cameras. We formulate the depth estimation as a weighted combination of depth bases, in which the weights are updated iteratively by the recurrent refinement strategy. During the iterative update, the results of depth estimation are compared across cameras, and the information of overlapping areas is propagated to the whole depth maps with the help of basis formulation in diffusion process. We integrate the GRU-based Weight Net into the diffusion process, allowing the refined hidden state to serve as a conditional input to accurately control the next iterative denoising step. Furthermore, by incorporating the proposed depth consistency loss, we ensure structural consistency across cameras, even in regions with minimal overlap. Experimental results on DDAD, NuScenes, Cityscapes, and Waymo Open Datasets demonstrate the superior performance of our method, and show great help for the downstream task.","PeriodicalId":13416,"journal":{"name":"IEEE Transactions on Intelligent Transportation Systems","volume":"26 7","pages":"9609-9624"},"PeriodicalIF":7.9000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Intelligent Transportation Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11021548/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CIVIL","Score":null,"Total":0}
引用次数: 0
Abstract
Depth map estimation from images is a crucial task in self-driving applications. Existing methods can be categorized into two groups: multi-view stereo and monocular depth estimation. The former requires cameras to have large overlapping areas and a sufficient baseline between them, while the latter that processes each image independently can hardly guarantee the structure consistency between cameras. In this paper, we propose a novel self-supervised multi-camera collaborative depth prediction method with latent diffusion models, which does not require large overlapping areas while maintaining structure consistency between cameras. Specifically, we introduce MCDP, a new generative foundation model for estimating depth attributes for multi-cameras. We formulate the depth estimation as a weighted combination of depth bases, in which the weights are updated iteratively by the recurrent refinement strategy. During the iterative update, the results of depth estimation are compared across cameras, and the information of overlapping areas is propagated to the whole depth maps with the help of basis formulation in diffusion process. We integrate the GRU-based Weight Net into the diffusion process, allowing the refined hidden state to serve as a conditional input to accurately control the next iterative denoising step. Furthermore, by incorporating the proposed depth consistency loss, we ensure structural consistency across cameras, even in regions with minimal overlap. Experimental results on DDAD, NuScenes, Cityscapes, and Waymo Open Datasets demonstrate the superior performance of our method, and show great help for the downstream task.
期刊介绍:
The theoretical, experimental and operational aspects of electrical and electronics engineering and information technologies as applied to Intelligent Transportation Systems (ITS). Intelligent Transportation Systems are defined as those systems utilizing synergistic technologies and systems engineering concepts to develop and improve transportation systems of all kinds. The scope of this interdisciplinary activity includes the promotion, consolidation and coordination of ITS technical activities among IEEE entities, and providing a focus for cooperative activities, both internally and externally.