Fusion4DAL: Offline Multi-modal 3D Object Detection for 4D Auto-labeling

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-02-15 DOI:10.1007/s11263-025-02370-1

Zhiyuan Yang, Xuekuan Wang, Wei Zhang, Xiao Tan, Jincheng Lu, Jingdong Wang, Errui Ding, Cairong Zhao

{"title":"Fusion4DAL: Offline Multi-modal 3D Object Detection for 4D Auto-labeling","authors":"Zhiyuan Yang, Xuekuan Wang, Wei Zhang, Xiao Tan, Jincheng Lu, Jingdong Wang, Errui Ding, Cairong Zhao","doi":"10.1007/s11263-025-02370-1","DOIUrl":null,"url":null,"abstract":"<p>Integrating LiDAR and camera information has been a widely adopted approach for 3D object detection in autonomous driving. Nevertheless, the unexplored potential of multi-modal fusion remains in the realm of offline 4D detection. We experimentally find that the root lies in two reasons: (1) the sparsity of point clouds poses a challenge in extracting long-term image features and thereby results in information loss. (2) some of the LiDAR points may be obstructed in the image, leading to incorrect image features. To tackle these problems, we first propose a simple yet effective offline multi-modal 3D object detection method, named Fusion4DAL, for 4D auto-labeling with long-term multi-modal sequences. Specifically, in order to address the sparsity of points within objects, we propose a multi-modal mixed feature fusion module (MMFF). In the MMFF module, we introduce virtual points based on a dense 3D grid and combine them with real LiDAR points. The mixed points are then utilized to extract dense point-level image features, thereby enhancing multi-modal feature fusion without being constrained by the sparse real LiDAR points. As to the obstructed LiDAR points, we leverage the occlusion relationship among objects to ensure depth consistency between LiDAR points and their corresponding depth feature maps, thus filtering out erroneous image features. In addition, we define a virtual point loss (VP Loss) to distinguish different types of mixed points and preserve the geometric shape of objects. Furthermore, in order to promote long-term receptive field and capture finer-grained features, we propose a global point attention decoder with a box-level self-attention module and a global point attention module. Finally, comprehensive experiments show that Fusion4DAL outperforms state-of-the-art offline 3D detection methods on nuScenes and Waymo dataset.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02370-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Integrating LiDAR and camera information has been a widely adopted approach for 3D object detection in autonomous driving. Nevertheless, the unexplored potential of multi-modal fusion remains in the realm of offline 4D detection. We experimentally find that the root lies in two reasons: (1) the sparsity of point clouds poses a challenge in extracting long-term image features and thereby results in information loss. (2) some of the LiDAR points may be obstructed in the image, leading to incorrect image features. To tackle these problems, we first propose a simple yet effective offline multi-modal 3D object detection method, named Fusion4DAL, for 4D auto-labeling with long-term multi-modal sequences. Specifically, in order to address the sparsity of points within objects, we propose a multi-modal mixed feature fusion module (MMFF). In the MMFF module, we introduce virtual points based on a dense 3D grid and combine them with real LiDAR points. The mixed points are then utilized to extract dense point-level image features, thereby enhancing multi-modal feature fusion without being constrained by the sparse real LiDAR points. As to the obstructed LiDAR points, we leverage the occlusion relationship among objects to ensure depth consistency between LiDAR points and their corresponding depth feature maps, thus filtering out erroneous image features. In addition, we define a virtual point loss (VP Loss) to distinguish different types of mixed points and preserve the geometric shape of objects. Furthermore, in order to promote long-term receptive field and capture finer-grained features, we propose a global point attention decoder with a box-level self-attention module and a global point attention module. Finally, comprehensive experiments show that Fusion4DAL outperforms state-of-the-art offline 3D detection methods on nuScenes and Waymo dataset.

查看原文本刊更多论文

Fusion4DAL：离线多模态3D对象检测，用于4D自动标记

将激光雷达与摄像头信息相结合已成为自动驾驶中广泛采用的三维目标检测方法。然而，多模态融合的未开发潜力仍然存在于离线四维检测领域。我们通过实验发现，其根源在于两个方面：(1)点云的稀疏性给提取长期图像特征带来了挑战，从而导致信息丢失。(2)部分LiDAR点可能在图像中被遮挡，导致图像特征不正确。为了解决这些问题，我们首先提出了一种简单而有效的离线多模态3D物体检测方法，命名为Fusion4DAL，用于长期多模态序列的4D自动标记。具体来说，为了解决对象内点的稀疏性问题，我们提出了一种多模态混合特征融合模块（MMFF）。在MMFF模块中，我们引入了基于密集3D网格的虚拟点，并将它们与真实的LiDAR点相结合。然后利用混合点提取密集的点级图像特征，从而增强多模态特征融合，而不受稀疏的真实LiDAR点的约束。对于被遮挡的LiDAR点，我们利用物体间的遮挡关系，保证LiDAR点与其对应深度特征图的深度一致性，从而过滤掉错误的图像特征。此外，我们定义了虚拟点损失（VP loss）来区分不同类型的混合点，并保持物体的几何形状。此外，为了促进长期接受场和捕获更细粒度的特征，我们提出了一种具有盒级自注意模块和全局点注意模块的全局点注意解码器。最后，综合实验表明，在nuScenes和Waymo数据集上，Fusion4DAL优于最先进的离线3D检测方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.