MGAF: LiDAR-Camera 3D Object Detection with Multiple Guidance and Adaptive Fusion.

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-09-22 DOI:10.1109/tpami.2025.3612958

Baojie Fan,Xiaotian Li,Yuhan Zhou,Caixia Xia,Huijie Fan,Fengyu Xu,Jiandong Tian

{"title":"MGAF: LiDAR-Camera 3D Object Detection with Multiple Guidance and Adaptive Fusion.","authors":"Baojie Fan,Xiaotian Li,Yuhan Zhou,Caixia Xia,Huijie Fan,Fengyu Xu,Jiandong Tian","doi":"10.1109/tpami.2025.3612958","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird's-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, with multi-guided global interaction and LiDAR-guided adaptive fusion, named MGAF. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth and spatial information. The designed semantic segmentation network captures category and orientation prior information for raw point clouds. In the following, an Adaptive Fusion Dual Transformer (AFDT) is developed to adaptively enhance the interaction of different modal BEV features from both global and bidirectional perspectives. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed in order to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. Notably, the proposed AFDT is general, which also shows superior performance on other models. Our framework has undergone extensive experimentation on the large-scale nuScenes dataset, Waymo Open Dataset, and long-range Argoverse2 dataset, consistently demonstrating state-of-the-art performance. The code will be released at:https://github.com/xioatian1/MGAF. 3D object detection, multi-modality, multiple guidance, adaptive fusion, BEV representation, autonomous driving.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"51 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3612958","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird's-Eye-View (BEV) perspective. However, most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work, we propose a novel multi-modality 3D objection detection method, with multi-guided global interaction and LiDAR-guided adaptive fusion, named MGAF. Specifically, we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth and spatial information. The designed semantic segmentation network captures category and orientation prior information for raw point clouds. In the following, an Adaptive Fusion Dual Transformer (AFDT) is developed to adaptively enhance the interaction of different modal BEV features from both global and bidirectional perspectives. Meanwhile, additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed in order to enlarge the receptive fields of different modal features. Finally, a temporal fusion module is introduced to aggregate features from previous frames. Notably, the proposed AFDT is general, which also shows superior performance on other models. Our framework has undergone extensive experimentation on the large-scale nuScenes dataset, Waymo Open Dataset, and long-range Argoverse2 dataset, consistently demonstrating state-of-the-art performance. The code will be released at:https://github.com/xioatian1/MGAF. 3D object detection, multi-modality, multiple guidance, adaptive fusion, BEV representation, autonomous driving.

查看原文本刊更多论文

MGAF：激光雷达-相机三维目标检测与多制导和自适应融合。

近年来，基于鸟瞰图（BEV）视角的三维多模态目标检测方法取得了显著进展。然而，它们大多忽略了激光雷达与相机之间的互补交互和引导。在这项工作中，我们提出了一种新的多模态三维目标检测方法，称为MGAF，该方法具有多制导全局交互和激光雷达制导自适应融合。具体来说，我们引入了稀疏深度制导（SDG）和激光雷达占用制导（LOG）来生成具有足够深度和空间信息的3D特征。所设计的语义分割网络捕获原始点云的类别和方向先验信息。本文开发了一种自适应融合双变压器（AFDT），从全局和双向角度自适应增强不同模式BEV特征的相互作用。同时，设计了稀疏高度压缩附加下采样和多尺度双径变压器（MSDPT），以扩大不同模态特征的接收场。最后，引入时间融合模块对前一帧的特征进行聚合。值得注意的是，所提出的AFDT是通用的，在其他模型上也表现出优越的性能。我们的框架已经在大规模nuScenes数据集、Waymo开放数据集和远程Argoverse2数据集上进行了广泛的实验，始终显示出最先进的性能。代码将在https://github.com/xioatian1/MGAF上发布。3D目标检测，多模态，多制导，自适应融合，BEV表示，自动驾驶。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.