Peicheng Shi , Xinlong Dong , Runshuai Ge , Zhiqiang Liu , Aixi Yang
{"title":"Dp-M3D: Monocular 3D object detection algorithm with depth perception capability","authors":"Peicheng Shi , Xinlong Dong , Runshuai Ge , Zhiqiang Liu , Aixi Yang","doi":"10.1016/j.knosys.2025.113539","DOIUrl":null,"url":null,"abstract":"<div><div>Considering the limitations of monocular 3D object detection in depth information and perception ability, we introduce a novel monocular 3D object detection algorithm, Dp-M3D, equipped with depth perception capabilities. To effectively model long-range feature dependencies during the fusion of depth maps and image features, we introduce a Transformer Feature Fusion Encoder (TFFEn). TFFEn integrates depth and image features, enabling more comprehensive long-range feature modeling. This enhances depth perception, ultimately improving the accuracy of 3D object detection. To enhance the detection ability of truncated objects at the edges of an image, we propose a Feature Enhancement method based on Deformable Convolution (FEDC). FEDC leverages depth confidence guidance to determine the deformation offset of the 3D bounding box, aligning features more effectively and improving depth perception. Furthermore, to address the issue of anchor box ranking, where candidate boxes with accurate depth predictions but low classification confidence are suppressed, we propose a Depth-perception Non-Maximum Suppression (Dp-NMS) algorithm. Dp-NMS refines the selection process by incorporating the product of classification confidence and depth confidence, ensuring that candidate boxes are ranked effectively and the most suitable detection box is retained. Experimental results on the challenging KITTI 3D object detection dataset demonstrate that the proposed method achieves <span><math><mrow><mi>A</mi><msub><mi>P</mi><mrow><mn>3</mn><mi>D</mi></mrow></msub></mrow></math></span> scores of 23.41 %, 13.65 %, and 12.91 % in the easy, moderate, and hard categories, respectively. Our approach outperforms state-of-the-art monocular 3D object detection algorithms based on image and image-depth map fusion, with particularly significant improvements in detecting edge-truncated objects.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"318 ","pages":"Article 113539"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125005854","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Considering the limitations of monocular 3D object detection in depth information and perception ability, we introduce a novel monocular 3D object detection algorithm, Dp-M3D, equipped with depth perception capabilities. To effectively model long-range feature dependencies during the fusion of depth maps and image features, we introduce a Transformer Feature Fusion Encoder (TFFEn). TFFEn integrates depth and image features, enabling more comprehensive long-range feature modeling. This enhances depth perception, ultimately improving the accuracy of 3D object detection. To enhance the detection ability of truncated objects at the edges of an image, we propose a Feature Enhancement method based on Deformable Convolution (FEDC). FEDC leverages depth confidence guidance to determine the deformation offset of the 3D bounding box, aligning features more effectively and improving depth perception. Furthermore, to address the issue of anchor box ranking, where candidate boxes with accurate depth predictions but low classification confidence are suppressed, we propose a Depth-perception Non-Maximum Suppression (Dp-NMS) algorithm. Dp-NMS refines the selection process by incorporating the product of classification confidence and depth confidence, ensuring that candidate boxes are ranked effectively and the most suitable detection box is retained. Experimental results on the challenging KITTI 3D object detection dataset demonstrate that the proposed method achieves scores of 23.41 %, 13.65 %, and 12.91 % in the easy, moderate, and hard categories, respectively. Our approach outperforms state-of-the-art monocular 3D object detection algorithms based on image and image-depth map fusion, with particularly significant improvements in detecting edge-truncated objects.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.