{"title":"Toward Effective 3D Object Detection via Multimodal Fusion to Automatic Driving for Industrial Cyber-Physical Systems","authors":"Honghao Gao;Yan Sun;Junsheng Xiao;Danqing Fang;Yueshen Xu;Wei Wei","doi":"10.1109/TICPS.2024.3427060","DOIUrl":null,"url":null,"abstract":"AI-empowered automatic driving has experienced rapid development in industrial cyber-physical systems (CPSs), especially in safety vehicles and driverless technologies. 3D object detection is an important task for perceiving the surrounding environment and supporting decision-making when vehicles are on the road, and is also a focus in CPSs. Light detection and ranging (LiDAR)-based detection methods usually lack semantic information, resulting in high uncertainty with incorrect outputs. Thus, handling complex road scenes is difficult. Some data fusion-based methods have been developed to solve these issues. However, the spatiotemporal data misalignment between different sensors is prone to losing information during data fusion. This paper proposes exploiting multimodal information to learn more high-level features to address these issues thus reducing the uncertainty of 3D object detection. First, the VxMLA (voxel and multilevel attention) framework is employed to improve point cloud identification and modeling during 3D object detection. Second, the MF-CAMRL (modal fusion-based channel attention and multidimensional regression loss) model is proposed with two subnetworks. Our model encompasses two strategies, i.e., a multimodal fusion and a deep learning model based on CAMRL. One focuses on the semantic complementarity and geometric proximity for decisionlevel fusion. The other focuses on weighted ensemble bounding boxes to fully utilize the highlevel decision information derived from both modalities and reduce the information loss incurred during modal fusion. Finally, sufficient experiments are performed on the KITTI dataset and presented. The results show that our method is superior to baseline methods.","PeriodicalId":100640,"journal":{"name":"IEEE Transactions on Industrial Cyber-Physical Systems","volume":"2 ","pages":"281-291"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Industrial Cyber-Physical Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10596943/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
AI-empowered automatic driving has experienced rapid development in industrial cyber-physical systems (CPSs), especially in safety vehicles and driverless technologies. 3D object detection is an important task for perceiving the surrounding environment and supporting decision-making when vehicles are on the road, and is also a focus in CPSs. Light detection and ranging (LiDAR)-based detection methods usually lack semantic information, resulting in high uncertainty with incorrect outputs. Thus, handling complex road scenes is difficult. Some data fusion-based methods have been developed to solve these issues. However, the spatiotemporal data misalignment between different sensors is prone to losing information during data fusion. This paper proposes exploiting multimodal information to learn more high-level features to address these issues thus reducing the uncertainty of 3D object detection. First, the VxMLA (voxel and multilevel attention) framework is employed to improve point cloud identification and modeling during 3D object detection. Second, the MF-CAMRL (modal fusion-based channel attention and multidimensional regression loss) model is proposed with two subnetworks. Our model encompasses two strategies, i.e., a multimodal fusion and a deep learning model based on CAMRL. One focuses on the semantic complementarity and geometric proximity for decisionlevel fusion. The other focuses on weighted ensemble bounding boxes to fully utilize the highlevel decision information derived from both modalities and reduce the information loss incurred during modal fusion. Finally, sufficient experiments are performed on the KITTI dataset and presented. The results show that our method is superior to baseline methods.