A Transformer based on Voxel Spatial-Channel Attention for 3D object detection

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters Pub Date : 2025-05-17 DOI:10.1016/j.patrec.2025.04.034

Jun Lu, Guangyu Ji, Chengtao Cai, Kaibin Qin

{"title":"A Transformer based on Voxel Spatial-Channel Attention for 3D object detection","authors":"Jun Lu, Guangyu Ji, Chengtao Cai, Kaibin Qin","doi":"10.1016/j.patrec.2025.04.034","DOIUrl":null,"url":null,"abstract":"<div><div>Existing voxel-based object detection methods primarily use convolution or sparse convolution for feature extraction, followed by the classification and regression tasks based on voxel features. However, the coarse representation of point clouds by voxels can limit the ability to capture small object features, and the precision of 3D bounding-box regression may also be compromised, ultimately impacting detection accuracy. To address this issue, we propose a novel voxel-based architecture, the Voxel Spatial-Channel Transformer (VoxSCT), to detect 3D objects from point clouds through point-to-point translation. VoxSCT is constructed based on a Voxel-based Spatial-Channel Attention (VSCA) module. The global and local channel attention modules of VSCA enhance the model’s sensitivity to local feature variations within a voxel, enabling it to distinguish different objects within the same voxel. Additionally, the global and local spatial attention modules of VSCA identify relationships between different parts of an object scattered across multiple voxels, allowing the network to better represent entire objects. By integrating various geometric features, VoxSCT enhances the representation of small objects. Ultimately, it reassigns voxel features to the original points through a cross-attention module, utilizing the original points for classification and regression, thereby improving the precision of 3D bounding-boxes. The proposed VoxSCT combines the accuracy of point-based models with the efficiency of voxel-based models, making it a promising alternative for voxel-based backbones. VoxSCT achieves mAP scores of 78.22% and 70.56% on LEVEL 1 and LEVEL 2 of the vehicle category in the Waymo validation 3D detection benchmark.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"195 ","pages":"Pages 37-43"},"PeriodicalIF":3.9000,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016786552500176X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Existing voxel-based object detection methods primarily use convolution or sparse convolution for feature extraction, followed by the classification and regression tasks based on voxel features. However, the coarse representation of point clouds by voxels can limit the ability to capture small object features, and the precision of 3D bounding-box regression may also be compromised, ultimately impacting detection accuracy. To address this issue, we propose a novel voxel-based architecture, the Voxel Spatial-Channel Transformer (VoxSCT), to detect 3D objects from point clouds through point-to-point translation. VoxSCT is constructed based on a Voxel-based Spatial-Channel Attention (VSCA) module. The global and local channel attention modules of VSCA enhance the model’s sensitivity to local feature variations within a voxel, enabling it to distinguish different objects within the same voxel. Additionally, the global and local spatial attention modules of VSCA identify relationships between different parts of an object scattered across multiple voxels, allowing the network to better represent entire objects. By integrating various geometric features, VoxSCT enhances the representation of small objects. Ultimately, it reassigns voxel features to the original points through a cross-attention module, utilizing the original points for classification and regression, thereby improving the precision of 3D bounding-boxes. The proposed VoxSCT combines the accuracy of point-based models with the efficiency of voxel-based models, making it a promising alternative for voxel-based backbones. VoxSCT achieves mAP scores of 78.22% and 70.56% on LEVEL 1 and LEVEL 2 of the vehicle category in the Waymo validation 3D detection benchmark.

查看原文本刊更多论文

基于体素空间通道注意的三维目标检测变压器

现有的基于体素的目标检测方法主要使用卷积或稀疏卷积进行特征提取，其次是基于体素特征的分类和回归任务。然而，用体素粗略地表示点云会限制捕获小目标特征的能力，并且3D边界盒回归的精度也可能受到损害，最终影响检测精度。为了解决这个问题，我们提出了一种新的基于体素的体系结构，即体素空间通道转换器（VoxSCT），通过点对点转换从点云中检测3D物体。VoxSCT是基于基于体素的空间通道注意（VSCA）模块构建的。VSCA的全局和局部通道关注模块增强了模型对体素内局部特征变化的敏感性，使其能够区分同一体素内的不同物体。此外，VSCA的全局和局部空间注意模块识别分散在多个体素上的物体不同部分之间的关系，使网络能够更好地表示整个物体。通过整合各种几何特征，VoxSCT增强了小物体的表示。最后通过交叉关注模块将体素特征重新分配给原始点，利用原始点进行分类和回归，从而提高三维边界盒的精度。所提出的VoxSCT结合了基于点的模型的准确性和基于体素的模型的效率，使其成为基于体素的骨干网的一个有希望的替代方案。在Waymo验证3D检测基准中，VoxSCT在车辆类别的LEVEL 1和LEVEL 2上的mAP得分分别为78.22%和70.56%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition Letters 工程技术-计算机：人工智能

CiteScore

12.40

自引率

5.90%

发文量

287

审稿时长

9.1 months

期刊介绍： Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.