Height-Adaptive Deformable Multi-Modal Fusion for 3D Object Detection

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-03-21 DOI:10.1109/ACCESS.2025.3553372

Jiahao Li;Lingshan Chen;Zhen Li

{"title":"Height-Adaptive Deformable Multi-Modal Fusion for 3D Object Detection","authors":"Jiahao Li;Lingshan Chen;Zhen Li","doi":"10.1109/ACCESS.2025.3553372","DOIUrl":null,"url":null,"abstract":"LiDAR-Camera fusion has demonstrated remarkable potential in 3D object detection for autonomous vehicles, leveraging complementary information from both modalities. Recent state-of-the-art approaches primarily make use of projection matrices to achieve cross-modal data alignment. However, these methods often struggle with poor performance when faced with sensor misalignment or calibration errors, resulting in suboptimal fusion quality and limited robustness. In this paper, we propose a novel framework for 3D object detection, called Height-Adaptive Deformable Multi-Modal Fusion, which leverages Deformable Attention to enhance the fusion process. Specifically, we introduce a Deformable-based Cross-Modal Spatial Attention that dynamically fuse image features through learnable offsets, allowing for more flexible and precise alignment between the LiDAR and camera modalities. To further improve the fusion quality, we design a Height-Adaptive Aggregation strategy that mitigates the risk of incorrect fusion from background points while emphasizing the aggregation of foreground object features. In addition, we introduce projection noise to simulate misalign scenarios. To tackle these issues, an extra supervision loss is added. Extensive experiments on the nuScenes benchmark demonstrate the effectiveness and robustness of our proposed framework. Specifically, our methods significantly outperforms the LiDAR-only method and exhibits reduced precision degradation under sensor misalignment, outperforming other fusion-based approaches. Our results validate the potential of proposed framework for improving 3D object detection accuracy, particularly in real-world, imperfect sensor environments.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"52385-52396"},"PeriodicalIF":3.4000,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10935618","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10935618/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

LiDAR-Camera fusion has demonstrated remarkable potential in 3D object detection for autonomous vehicles, leveraging complementary information from both modalities. Recent state-of-the-art approaches primarily make use of projection matrices to achieve cross-modal data alignment. However, these methods often struggle with poor performance when faced with sensor misalignment or calibration errors, resulting in suboptimal fusion quality and limited robustness. In this paper, we propose a novel framework for 3D object detection, called Height-Adaptive Deformable Multi-Modal Fusion, which leverages Deformable Attention to enhance the fusion process. Specifically, we introduce a Deformable-based Cross-Modal Spatial Attention that dynamically fuse image features through learnable offsets, allowing for more flexible and precise alignment between the LiDAR and camera modalities. To further improve the fusion quality, we design a Height-Adaptive Aggregation strategy that mitigates the risk of incorrect fusion from background points while emphasizing the aggregation of foreground object features. In addition, we introduce projection noise to simulate misalign scenarios. To tackle these issues, an extra supervision loss is added. Extensive experiments on the nuScenes benchmark demonstrate the effectiveness and robustness of our proposed framework. Specifically, our methods significantly outperforms the LiDAR-only method and exhibits reduced precision degradation under sensor misalignment, outperforming other fusion-based approaches. Our results validate the potential of proposed framework for improving 3D object detection accuracy, particularly in real-world, imperfect sensor environments.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.