CL-fusionBEV：鸟瞰图中相机-激光雷达融合的 3D 物体探测方法

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Complex & Intelligent Systems Pub Date : 2024-07-27 DOI:10.1007/s40747-024-01567-0

Peicheng Shi, Zhiqiang Liu, Xinlong Dong, Aixi Yang

{"title":"CL-fusionBEV：鸟瞰图中相机-激光雷达融合的 3D 物体探测方法","authors":"Peicheng Shi, Zhiqiang Liu, Xinlong Dong, Aixi Yang","doi":"10.1007/s40747-024-01567-0","DOIUrl":null,"url":null,"abstract":"<p>In the wave of research on autonomous driving, 3D object detection from the Bird’s Eye View (BEV) perspective has emerged as a pivotal area of focus. The essence of this challenge is the effective fusion of camera and LiDAR data into the BEV. Current approaches predominantly train and predict within the front view and Cartesian coordinate system, often overlooking the inherent structural and operational differences between cameras and LiDAR sensors. This paper introduces CL-FusionBEV, an innovative 3D object detection methodology tailored for sensor data fusion in the BEV perspective. Our approach initiates with a view transformation, facilitated by an implicit learning module that transitions the camera’s perspective to the BEV space, thereby aligning the prediction module. Subsequently, to achieve modal fusion within the BEV framework, we employ voxelization to convert the LiDAR point cloud into BEV space, thereby generating LiDAR BEV spatial features. Moreover, to integrate the BEV spatial features from both camera and LiDAR, we have developed a multi-modal cross-attention mechanism and an implicit multi-modal fusion network, designed to enhance the synergy and application of dual-modal data. To counteract potential deficiencies in global reasoning and feature interaction arising from multi-modal cross-attention, we propose a BEV self-attention mechanism that facilitates comprehensive global feature operations. Our methodology has undergone rigorous evaluation on a substantial dataset within the autonomous driving domain, the nuScenes dataset. The outcomes demonstrate that our method achieves a mean Average Precision (mAP) of 73.3% and a nuScenes Detection Score (NDS) of 75.5%, particularly excelling in the detection of cars and pedestrians with high accuracies of 89% and 90.7%, respectively. Additionally, CL-FusionBEV exhibits superior performance in identifying occluded and distant objects, surpassing existing comparative methods.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":null,"pages":null},"PeriodicalIF":5.0000,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CL-fusionBEV: 3D object detection method with camera-LiDAR fusion in Bird’s Eye View\",\"authors\":\"Peicheng Shi, Zhiqiang Liu, Xinlong Dong, Aixi Yang\",\"doi\":\"10.1007/s40747-024-01567-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In the wave of research on autonomous driving, 3D object detection from the Bird’s Eye View (BEV) perspective has emerged as a pivotal area of focus. The essence of this challenge is the effective fusion of camera and LiDAR data into the BEV. Current approaches predominantly train and predict within the front view and Cartesian coordinate system, often overlooking the inherent structural and operational differences between cameras and LiDAR sensors. This paper introduces CL-FusionBEV, an innovative 3D object detection methodology tailored for sensor data fusion in the BEV perspective. Our approach initiates with a view transformation, facilitated by an implicit learning module that transitions the camera’s perspective to the BEV space, thereby aligning the prediction module. Subsequently, to achieve modal fusion within the BEV framework, we employ voxelization to convert the LiDAR point cloud into BEV space, thereby generating LiDAR BEV spatial features. Moreover, to integrate the BEV spatial features from both camera and LiDAR, we have developed a multi-modal cross-attention mechanism and an implicit multi-modal fusion network, designed to enhance the synergy and application of dual-modal data. To counteract potential deficiencies in global reasoning and feature interaction arising from multi-modal cross-attention, we propose a BEV self-attention mechanism that facilitates comprehensive global feature operations. Our methodology has undergone rigorous evaluation on a substantial dataset within the autonomous driving domain, the nuScenes dataset. The outcomes demonstrate that our method achieves a mean Average Precision (mAP) of 73.3% and a nuScenes Detection Score (NDS) of 75.5%, particularly excelling in the detection of cars and pedestrians with high accuracies of 89% and 90.7%, respectively. Additionally, CL-FusionBEV exhibits superior performance in identifying occluded and distant objects, surpassing existing comparative methods.</p>\",\"PeriodicalId\":10524,\"journal\":{\"name\":\"Complex & Intelligent Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-07-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Complex & Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s40747-024-01567-0\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01567-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在自动驾驶研究的浪潮中，从鸟瞰（BEV）角度进行三维物体检测已成为一个关键的重点领域。这一挑战的本质是将摄像头和激光雷达数据有效融合到 BEV 中。目前的方法主要是在前视角和笛卡尔坐标系下进行训练和预测，往往忽略了相机和激光雷达传感器之间固有的结构和操作差异。本文介绍了 CL-FusionBEV，这是一种创新的三维物体检测方法，专门针对 BEV 视角下的传感器数据融合而定制。我们的方法从视图转换开始，通过隐式学习模块将相机视角转换到 BEV 空间，从而对齐预测模块。随后，为了在 BEV 框架内实现模态融合，我们采用体素化技术将激光雷达点云转换到 BEV 空间，从而生成激光雷达 BEV 空间特征。此外，为了整合来自相机和激光雷达的 BEV 空间特征，我们还开发了多模态交叉注意机制和隐式多模态融合网络，旨在增强双模态数据的协同作用和应用。为了克服多模态交叉关注在全局推理和特征交互方面可能存在的缺陷，我们提出了一种 BEV 自关注机制，以促进全面的全局特征操作。我们的方法在自动驾驶领域的大量数据集 nuScenes 数据集上进行了严格评估。结果表明，我们的方法实现了 73.3% 的平均精度（mAP）和 75.5% 的 nuScenes 检测得分（NDS），尤其在汽车和行人检测方面表现出色，准确率分别高达 89% 和 90.7%。此外，CL-FusionBEV 在识别遮挡物体和远处物体方面表现出色，超过了现有的比较方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

CL-fusionBEV: 3D object detection method with camera-LiDAR fusion in Bird’s Eye View

查看原文本刊更多论文

CL-fusionBEV: 3D object detection method with camera-LiDAR fusion in Bird’s Eye View

In the wave of research on autonomous driving, 3D object detection from the Bird’s Eye View (BEV) perspective has emerged as a pivotal area of focus. The essence of this challenge is the effective fusion of camera and LiDAR data into the BEV. Current approaches predominantly train and predict within the front view and Cartesian coordinate system, often overlooking the inherent structural and operational differences between cameras and LiDAR sensors. This paper introduces CL-FusionBEV, an innovative 3D object detection methodology tailored for sensor data fusion in the BEV perspective. Our approach initiates with a view transformation, facilitated by an implicit learning module that transitions the camera’s perspective to the BEV space, thereby aligning the prediction module. Subsequently, to achieve modal fusion within the BEV framework, we employ voxelization to convert the LiDAR point cloud into BEV space, thereby generating LiDAR BEV spatial features. Moreover, to integrate the BEV spatial features from both camera and LiDAR, we have developed a multi-modal cross-attention mechanism and an implicit multi-modal fusion network, designed to enhance the synergy and application of dual-modal data. To counteract potential deficiencies in global reasoning and feature interaction arising from multi-modal cross-attention, we propose a BEV self-attention mechanism that facilitates comprehensive global feature operations. Our methodology has undergone rigorous evaluation on a substantial dataset within the autonomous driving domain, the nuScenes dataset. The outcomes demonstrate that our method achieves a mean Average Precision (mAP) of 73.3% and a nuScenes Detection Score (NDS) of 75.5%, particularly excelling in the detection of cars and pedestrians with high accuracies of 89% and 90.7%, respectively. Additionally, CL-FusionBEV exhibits superior performance in identifying occluded and distant objects, surpassing existing comparative methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

9.60

自引率

10.30%

发文量

297

期刊介绍： Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.