基于自监督单目深度估计的潜在目标嵌入

IF 5.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Emerging Topics in Computational Intelligence Pub Date : 2025-03-18 DOI:10.1109/TETCI.2025.3547851

Shuai Wang;Ting Yu;Shan Pan;Wei Chen;Zehua Wang;Victor C. M. Leung;Zijian Tian

{"title":"基于自监督单目深度估计的潜在目标嵌入","authors":"Shuai Wang;Ting Yu;Shan Pan;Wei Chen;Zehua Wang;Victor C. M. Leung;Zijian Tian","doi":"10.1109/TETCI.2025.3547851","DOIUrl":null,"url":null,"abstract":"Extracting 3D information from 2D images is highly significant, and self-supervised monocular depth estimation has demonstrated great potential in this field. However, existing methods primarily focus on estimating depth from immediate visual features, leading to severe foreground-background adhesion, which poses challenges for achieving precise depth estimation. In this paper, we propose a depth estimation method called LOEDepth, which can implicitly distinguish foreground objects from the background. In LOEDepth, a latent object embedding module is introduced, which leverages a set of learnable queries to generate latent object proposals from both immediate visual features extracted by the encoder and sparse object features derived through multi-scale deformable attention. These latent object proposals are utilized to perform soft classification on the decoded features to distinguish foreground objects from the background. Additionally, as depth boundaries do not always align with semantic boundaries, we propose a novel deep decoder to provide decoding features with rich spatial location retrieval and semantic information. Finally, two mask strategies are utilized to conceal pixels violating the scene's static assumption, so as to mitigate disruptions caused by abnormal pixels during self-supervised training. Experimental results on the KITTI and Make3D datasets demonstrate significant performance improvements and robust fine-grained scene depth estimation capabilities of the proposed method.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"9 5","pages":"3548-3559"},"PeriodicalIF":5.3000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Latent Object Embedding for Self-Supervised Monocular Depth Estimation\",\"authors\":\"Shuai Wang;Ting Yu;Shan Pan;Wei Chen;Zehua Wang;Victor C. M. Leung;Zijian Tian\",\"doi\":\"10.1109/TETCI.2025.3547851\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting 3D information from 2D images is highly significant, and self-supervised monocular depth estimation has demonstrated great potential in this field. However, existing methods primarily focus on estimating depth from immediate visual features, leading to severe foreground-background adhesion, which poses challenges for achieving precise depth estimation. In this paper, we propose a depth estimation method called LOEDepth, which can implicitly distinguish foreground objects from the background. In LOEDepth, a latent object embedding module is introduced, which leverages a set of learnable queries to generate latent object proposals from both immediate visual features extracted by the encoder and sparse object features derived through multi-scale deformable attention. These latent object proposals are utilized to perform soft classification on the decoded features to distinguish foreground objects from the background. Additionally, as depth boundaries do not always align with semantic boundaries, we propose a novel deep decoder to provide decoding features with rich spatial location retrieval and semantic information. Finally, two mask strategies are utilized to conceal pixels violating the scene's static assumption, so as to mitigate disruptions caused by abnormal pixels during self-supervised training. Experimental results on the KITTI and Make3D datasets demonstrate significant performance improvements and robust fine-grained scene depth estimation capabilities of the proposed method.\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"9 5\",\"pages\":\"3548-3559\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-03-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10930815/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10930815/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

从二维图像中提取三维信息具有重要意义，而自监督单目深度估计在该领域显示出巨大的潜力。然而，现有的方法主要集中在从直接的视觉特征中估计深度，导致严重的前景-背景粘附，这给实现精确的深度估计带来了挑战。在本文中，我们提出了一种称为LOEDepth的深度估计方法，该方法可以隐式区分前景目标和背景目标。在LOEDepth中，引入了潜在目标嵌入模块，该模块利用一组可学习的查询，从编码器提取的即时视觉特征和通过多尺度可变形注意派生的稀疏目标特征中生成潜在目标建议。利用这些潜在目标建议对解码后的特征进行软分类，以区分前景目标和背景目标。此外，由于深度边界并不总是与语义边界一致，我们提出了一种新的深度解码器，以提供具有丰富空间位置检索和语义信息的解码特征。最后，利用两种掩模策略来隐藏违反场景静态假设的像素，以减轻自监督训练过程中异常像素所造成的干扰。在KITTI和Make3D数据集上的实验结果表明，该方法具有显著的性能改进和鲁棒的细粒度场景深度估计能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Latent Object Embedding for Self-Supervised Monocular Depth Estimation

Extracting 3D information from 2D images is highly significant, and self-supervised monocular depth estimation has demonstrated great potential in this field. However, existing methods primarily focus on estimating depth from immediate visual features, leading to severe foreground-background adhesion, which poses challenges for achieving precise depth estimation. In this paper, we propose a depth estimation method called LOEDepth, which can implicitly distinguish foreground objects from the background. In LOEDepth, a latent object embedding module is introduced, which leverages a set of learnable queries to generate latent object proposals from both immediate visual features extracted by the encoder and sparse object features derived through multi-scale deformable attention. These latent object proposals are utilized to perform soft classification on the decoded features to distinguish foreground objects from the background. Additionally, as depth boundaries do not always align with semantic boundaries, we propose a novel deep decoder to provide decoding features with rich spatial location retrieval and semantic information. Finally, two mask strategies are utilized to conceal pixels violating the scene's static assumption, so as to mitigate disruptions caused by abnormal pixels during self-supervised training. Experimental results on the KITTI and Make3D datasets demonstrate significant performance improvements and robust fine-grained scene depth estimation capabilities of the proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Emerging Topics in Computational Intelligence Mathematics-Control and Optimization

CiteScore

10.30

自引率

7.50%

发文量

147

期刊介绍： The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.