A Transformer based on Voxel Spatial-Channel Attention for 3D object detection

IF 3.9 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jun Lu, Guangyu Ji, Chengtao Cai, Kaibin Qin
{"title":"A Transformer based on Voxel Spatial-Channel Attention for 3D object detection","authors":"Jun Lu,&nbsp;Guangyu Ji,&nbsp;Chengtao Cai,&nbsp;Kaibin Qin","doi":"10.1016/j.patrec.2025.04.034","DOIUrl":null,"url":null,"abstract":"<div><div>Existing voxel-based object detection methods primarily use convolution or sparse convolution for feature extraction, followed by the classification and regression tasks based on voxel features. However, the coarse representation of point clouds by voxels can limit the ability to capture small object features, and the precision of 3D bounding-box regression may also be compromised, ultimately impacting detection accuracy. To address this issue, we propose a novel voxel-based architecture, the Voxel Spatial-Channel Transformer (VoxSCT), to detect 3D objects from point clouds through point-to-point translation. VoxSCT is constructed based on a Voxel-based Spatial-Channel Attention (VSCA) module. The global and local channel attention modules of VSCA enhance the model’s sensitivity to local feature variations within a voxel, enabling it to distinguish different objects within the same voxel. Additionally, the global and local spatial attention modules of VSCA identify relationships between different parts of an object scattered across multiple voxels, allowing the network to better represent entire objects. By integrating various geometric features, VoxSCT enhances the representation of small objects. Ultimately, it reassigns voxel features to the original points through a cross-attention module, utilizing the original points for classification and regression, thereby improving the precision of 3D bounding-boxes. The proposed VoxSCT combines the accuracy of point-based models with the efficiency of voxel-based models, making it a promising alternative for voxel-based backbones. VoxSCT achieves mAP scores of 78.22% and 70.56% on LEVEL 1 and LEVEL 2 of the vehicle category in the Waymo validation 3D detection benchmark.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"195 ","pages":"Pages 37-43"},"PeriodicalIF":3.9000,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016786552500176X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Existing voxel-based object detection methods primarily use convolution or sparse convolution for feature extraction, followed by the classification and regression tasks based on voxel features. However, the coarse representation of point clouds by voxels can limit the ability to capture small object features, and the precision of 3D bounding-box regression may also be compromised, ultimately impacting detection accuracy. To address this issue, we propose a novel voxel-based architecture, the Voxel Spatial-Channel Transformer (VoxSCT), to detect 3D objects from point clouds through point-to-point translation. VoxSCT is constructed based on a Voxel-based Spatial-Channel Attention (VSCA) module. The global and local channel attention modules of VSCA enhance the model’s sensitivity to local feature variations within a voxel, enabling it to distinguish different objects within the same voxel. Additionally, the global and local spatial attention modules of VSCA identify relationships between different parts of an object scattered across multiple voxels, allowing the network to better represent entire objects. By integrating various geometric features, VoxSCT enhances the representation of small objects. Ultimately, it reassigns voxel features to the original points through a cross-attention module, utilizing the original points for classification and regression, thereby improving the precision of 3D bounding-boxes. The proposed VoxSCT combines the accuracy of point-based models with the efficiency of voxel-based models, making it a promising alternative for voxel-based backbones. VoxSCT achieves mAP scores of 78.22% and 70.56% on LEVEL 1 and LEVEL 2 of the vehicle category in the Waymo validation 3D detection benchmark.
基于体素空间通道注意的三维目标检测变压器
现有的基于体素的目标检测方法主要使用卷积或稀疏卷积进行特征提取,其次是基于体素特征的分类和回归任务。然而,用体素粗略地表示点云会限制捕获小目标特征的能力,并且3D边界盒回归的精度也可能受到损害,最终影响检测精度。为了解决这个问题,我们提出了一种新的基于体素的体系结构,即体素空间通道转换器(VoxSCT),通过点对点转换从点云中检测3D物体。VoxSCT是基于基于体素的空间通道注意(VSCA)模块构建的。VSCA的全局和局部通道关注模块增强了模型对体素内局部特征变化的敏感性,使其能够区分同一体素内的不同物体。此外,VSCA的全局和局部空间注意模块识别分散在多个体素上的物体不同部分之间的关系,使网络能够更好地表示整个物体。通过整合各种几何特征,VoxSCT增强了小物体的表示。最后通过交叉关注模块将体素特征重新分配给原始点,利用原始点进行分类和回归,从而提高三维边界盒的精度。所提出的VoxSCT结合了基于点的模型的准确性和基于体素的模型的效率,使其成为基于体素的骨干网的一个有希望的替代方案。在Waymo验证3D检测基准中,VoxSCT在车辆类别的LEVEL 1和LEVEL 2上的mAP得分分别为78.22%和70.56%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Pattern Recognition Letters
Pattern Recognition Letters 工程技术-计算机:人工智能
CiteScore
12.40
自引率
5.90%
发文量
287
审稿时长
9.1 months
期刊介绍: Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信