{"title":"A monocular three-dimensional object detection model based on uncertainty-guided depth combination for autonomous driving","authors":"Xin Zhou , Xiaolong Xu","doi":"10.1016/j.compeleceng.2024.109864","DOIUrl":null,"url":null,"abstract":"<div><div>Three-Dimensional (3D) object detection is a crucial task for enhancing safety and efficiency in autonomous driving. However, estimating depth from monocular images remains a challenging task. Most existing monocular 3D object detection methods rely on additional auxiliary data sources to compensate for the lack of spatial information in monocular images. Nevertheless, these methods bring substantial computational overhead and time-consuming preprocessing steps. To address this issue, we propose a novel depth estimation method for monocular images that does not rely on any auxiliary information. Leveraging both the texture and geometric cues of detected objects, our method generates two depth estimates for each object based on the extracted Region of Interest (RoI) features: a direct depth estimate and a height-based depth estimate with uncertainty modeling. Our model dynamically assigns weights to these depth estimates based on their respective uncertainties and combines them to obtain the final depth. During the training process, the model assigns higher weights to depth branches with higher uncertainties, as these estimates exhibit greater tolerance to errors. As the combined depth network introduces increased complexity, we utilize Group Normalization (GN) to better capture spatial information in the prediction branch outputs. Furthermore, we leverage the Two-Dimensional (2D) information of objects to predict the residual in the 2D center after downsampling, aiding in the regression of 3D center. On the KITTI benchmark, our model achieves an average precision (AP) of 16.65 % and 23.19 % on 3D and bird's-eye view (BEV) detection for the moderate category, surpassing the state-of-the-art (SOTA) models in each category.</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"120 ","pages":"Article 109864"},"PeriodicalIF":4.0000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790624007912","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Three-Dimensional (3D) object detection is a crucial task for enhancing safety and efficiency in autonomous driving. However, estimating depth from monocular images remains a challenging task. Most existing monocular 3D object detection methods rely on additional auxiliary data sources to compensate for the lack of spatial information in monocular images. Nevertheless, these methods bring substantial computational overhead and time-consuming preprocessing steps. To address this issue, we propose a novel depth estimation method for monocular images that does not rely on any auxiliary information. Leveraging both the texture and geometric cues of detected objects, our method generates two depth estimates for each object based on the extracted Region of Interest (RoI) features: a direct depth estimate and a height-based depth estimate with uncertainty modeling. Our model dynamically assigns weights to these depth estimates based on their respective uncertainties and combines them to obtain the final depth. During the training process, the model assigns higher weights to depth branches with higher uncertainties, as these estimates exhibit greater tolerance to errors. As the combined depth network introduces increased complexity, we utilize Group Normalization (GN) to better capture spatial information in the prediction branch outputs. Furthermore, we leverage the Two-Dimensional (2D) information of objects to predict the residual in the 2D center after downsampling, aiding in the regression of 3D center. On the KITTI benchmark, our model achieves an average precision (AP) of 16.65 % and 23.19 % on 3D and bird's-eye view (BEV) detection for the moderate category, surpassing the state-of-the-art (SOTA) models in each category.
期刊介绍:
The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency.
Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.