VIDF-Net: A Voxel-Image Dynamic Fusion method for 3D object detection

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2024-09-07 DOI:10.1016/j.cviu.2024.104164

Xuezhi Xiang , Dianang Li , Xi Wang , Xiankun Zhou , Yulong Qiao

{"title":"VIDF-Net: A Voxel-Image Dynamic Fusion method for 3D object detection","authors":"Xuezhi Xiang , Dianang Li , Xi Wang , Xiankun Zhou , Yulong Qiao","doi":"10.1016/j.cviu.2024.104164","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, multi-modal fusion methods have shown excellent performance in the field of 3D object detection, which select the voxel centers and globally fuse with image features across the scene. However, these approaches exist two issues. First, The distribution of voxel density is highly heterogeneous due to the discrete volumes. Additionally, there are significant differences in the features between images and point clouds. Global fusion does not take into account the correspondence between these two modalities, which leads to the insufficient fusion. In this paper, we propose a new multi-modal fusion method named Voxel-Image Dynamic Fusion (VIDF). Specifically, VIDF-Net is composed of the Voxel Centroid Mapping module (VCM) and the Deformable Attention Fusion module (DAF). The Voxel Centroid Mapping module is used to calculate the centroid of voxel features and map them onto the image plane, which can locate the position of voxel features more effectively. We then use the Deformable Attention Fusion module to dynamically calculates the offset of each voxel centroid from the image position and combine these two modalities. Furthermore, we propose Region Proposal Network with Channel-Spatial Aggregate to combine channel and spatial attention maps for improved multi-scale feature interaction. We conduct extensive experiments on the KITTI dataset to demonstrate the outstanding performance of proposed VIDF network. In particular, significant improvements have been observed in the Hard categories of Cars and Pedestrians, which shows the significant effectiveness of our approach in dealing with complex scenarios.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104164"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002455","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, multi-modal fusion methods have shown excellent performance in the field of 3D object detection, which select the voxel centers and globally fuse with image features across the scene. However, these approaches exist two issues. First, The distribution of voxel density is highly heterogeneous due to the discrete volumes. Additionally, there are significant differences in the features between images and point clouds. Global fusion does not take into account the correspondence between these two modalities, which leads to the insufficient fusion. In this paper, we propose a new multi-modal fusion method named Voxel-Image Dynamic Fusion (VIDF). Specifically, VIDF-Net is composed of the Voxel Centroid Mapping module (VCM) and the Deformable Attention Fusion module (DAF). The Voxel Centroid Mapping module is used to calculate the centroid of voxel features and map them onto the image plane, which can locate the position of voxel features more effectively. We then use the Deformable Attention Fusion module to dynamically calculates the offset of each voxel centroid from the image position and combine these two modalities. Furthermore, we propose Region Proposal Network with Channel-Spatial Aggregate to combine channel and spatial attention maps for improved multi-scale feature interaction. We conduct extensive experiments on the KITTI dataset to demonstrate the outstanding performance of proposed VIDF network. In particular, significant improvements have been observed in the Hard categories of Cars and Pedestrians, which shows the significant effectiveness of our approach in dealing with complex scenarios.

查看原文本刊更多论文

VIDF-Net：用于三维物体检测的体素-图像动态融合方法

近年来，多模态融合方法在三维物体检测领域表现出色，这些方法选择体素中心，并与整个场景的图像特征进行全局融合。然而，这些方法存在两个问题。首先，由于体积离散，体素密度的分布具有高度异质性。此外，图像和点云之间的特征也存在显著差异。全局融合没有考虑这两种模式之间的对应关系，从而导致融合不充分。在本文中，我们提出了一种新的多模态融合方法，名为体素-图像动态融合（VIDF）。具体来说，VIDF-Net 由体素中心点映射模块（VCM）和可变形注意力融合模块（DAF）组成。体素中心点映射模块用于计算体素特征的中心点，并将其映射到图像平面上，从而更有效地定位体素特征的位置。然后，我们使用可变形注意力融合模块动态计算每个体素中心点与图像位置的偏移，并将这两种模式结合起来。此外，我们还提出了具有通道-空间聚合功能的区域建议网络（Region Proposal Network with Channel-Spatial Aggregate），以结合通道和空间注意力图，从而改进多尺度特征交互。我们在 KITTI 数据集上进行了大量实验，证明了所提出的 VIDF 网络的卓越性能。特别是在汽车和行人这两个难点类别中，我们观察到了明显的改进，这表明我们的方法在处理复杂场景时非常有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems