Shuai Wang;Ting Yu;Shan Pan;Wei Chen;Zehua Wang;Victor C. M. Leung;Zijian Tian
{"title":"基于自监督单目深度估计的潜在目标嵌入","authors":"Shuai Wang;Ting Yu;Shan Pan;Wei Chen;Zehua Wang;Victor C. M. Leung;Zijian Tian","doi":"10.1109/TETCI.2025.3547851","DOIUrl":null,"url":null,"abstract":"Extracting 3D information from 2D images is highly significant, and self-supervised monocular depth estimation has demonstrated great potential in this field. However, existing methods primarily focus on estimating depth from immediate visual features, leading to severe foreground-background adhesion, which poses challenges for achieving precise depth estimation. In this paper, we propose a depth estimation method called LOEDepth, which can implicitly distinguish foreground objects from the background. In LOEDepth, a latent object embedding module is introduced, which leverages a set of learnable queries to generate latent object proposals from both immediate visual features extracted by the encoder and sparse object features derived through multi-scale deformable attention. These latent object proposals are utilized to perform soft classification on the decoded features to distinguish foreground objects from the background. Additionally, as depth boundaries do not always align with semantic boundaries, we propose a novel deep decoder to provide decoding features with rich spatial location retrieval and semantic information. Finally, two mask strategies are utilized to conceal pixels violating the scene's static assumption, so as to mitigate disruptions caused by abnormal pixels during self-supervised training. Experimental results on the KITTI and Make3D datasets demonstrate significant performance improvements and robust fine-grained scene depth estimation capabilities of the proposed method.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"9 5","pages":"3548-3559"},"PeriodicalIF":5.3000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Latent Object Embedding for Self-Supervised Monocular Depth Estimation\",\"authors\":\"Shuai Wang;Ting Yu;Shan Pan;Wei Chen;Zehua Wang;Victor C. M. Leung;Zijian Tian\",\"doi\":\"10.1109/TETCI.2025.3547851\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting 3D information from 2D images is highly significant, and self-supervised monocular depth estimation has demonstrated great potential in this field. However, existing methods primarily focus on estimating depth from immediate visual features, leading to severe foreground-background adhesion, which poses challenges for achieving precise depth estimation. In this paper, we propose a depth estimation method called LOEDepth, which can implicitly distinguish foreground objects from the background. In LOEDepth, a latent object embedding module is introduced, which leverages a set of learnable queries to generate latent object proposals from both immediate visual features extracted by the encoder and sparse object features derived through multi-scale deformable attention. These latent object proposals are utilized to perform soft classification on the decoded features to distinguish foreground objects from the background. Additionally, as depth boundaries do not always align with semantic boundaries, we propose a novel deep decoder to provide decoding features with rich spatial location retrieval and semantic information. Finally, two mask strategies are utilized to conceal pixels violating the scene's static assumption, so as to mitigate disruptions caused by abnormal pixels during self-supervised training. Experimental results on the KITTI and Make3D datasets demonstrate significant performance improvements and robust fine-grained scene depth estimation capabilities of the proposed method.\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"9 5\",\"pages\":\"3548-3559\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-03-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10930815/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10930815/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Latent Object Embedding for Self-Supervised Monocular Depth Estimation
Extracting 3D information from 2D images is highly significant, and self-supervised monocular depth estimation has demonstrated great potential in this field. However, existing methods primarily focus on estimating depth from immediate visual features, leading to severe foreground-background adhesion, which poses challenges for achieving precise depth estimation. In this paper, we propose a depth estimation method called LOEDepth, which can implicitly distinguish foreground objects from the background. In LOEDepth, a latent object embedding module is introduced, which leverages a set of learnable queries to generate latent object proposals from both immediate visual features extracted by the encoder and sparse object features derived through multi-scale deformable attention. These latent object proposals are utilized to perform soft classification on the decoded features to distinguish foreground objects from the background. Additionally, as depth boundaries do not always align with semantic boundaries, we propose a novel deep decoder to provide decoding features with rich spatial location retrieval and semantic information. Finally, two mask strategies are utilized to conceal pixels violating the scene's static assumption, so as to mitigate disruptions caused by abnormal pixels during self-supervised training. Experimental results on the KITTI and Make3D datasets demonstrate significant performance improvements and robust fine-grained scene depth estimation capabilities of the proposed method.
期刊介绍:
The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys.
TETCI is an electronics only publication. TETCI publishes six issues per year.
Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.