{"title":"基于体素空间通道注意的三维目标检测变压器","authors":"Jun Lu, Guangyu Ji, Chengtao Cai, Kaibin Qin","doi":"10.1016/j.patrec.2025.04.034","DOIUrl":null,"url":null,"abstract":"<div><div>Existing voxel-based object detection methods primarily use convolution or sparse convolution for feature extraction, followed by the classification and regression tasks based on voxel features. However, the coarse representation of point clouds by voxels can limit the ability to capture small object features, and the precision of 3D bounding-box regression may also be compromised, ultimately impacting detection accuracy. To address this issue, we propose a novel voxel-based architecture, the Voxel Spatial-Channel Transformer (VoxSCT), to detect 3D objects from point clouds through point-to-point translation. VoxSCT is constructed based on a Voxel-based Spatial-Channel Attention (VSCA) module. The global and local channel attention modules of VSCA enhance the model’s sensitivity to local feature variations within a voxel, enabling it to distinguish different objects within the same voxel. Additionally, the global and local spatial attention modules of VSCA identify relationships between different parts of an object scattered across multiple voxels, allowing the network to better represent entire objects. By integrating various geometric features, VoxSCT enhances the representation of small objects. Ultimately, it reassigns voxel features to the original points through a cross-attention module, utilizing the original points for classification and regression, thereby improving the precision of 3D bounding-boxes. The proposed VoxSCT combines the accuracy of point-based models with the efficiency of voxel-based models, making it a promising alternative for voxel-based backbones. VoxSCT achieves mAP scores of 78.22% and 70.56% on LEVEL 1 and LEVEL 2 of the vehicle category in the Waymo validation 3D detection benchmark.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"195 ","pages":"Pages 37-43"},"PeriodicalIF":3.9000,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Transformer based on Voxel Spatial-Channel Attention for 3D object detection\",\"authors\":\"Jun Lu, Guangyu Ji, Chengtao Cai, Kaibin Qin\",\"doi\":\"10.1016/j.patrec.2025.04.034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Existing voxel-based object detection methods primarily use convolution or sparse convolution for feature extraction, followed by the classification and regression tasks based on voxel features. However, the coarse representation of point clouds by voxels can limit the ability to capture small object features, and the precision of 3D bounding-box regression may also be compromised, ultimately impacting detection accuracy. To address this issue, we propose a novel voxel-based architecture, the Voxel Spatial-Channel Transformer (VoxSCT), to detect 3D objects from point clouds through point-to-point translation. VoxSCT is constructed based on a Voxel-based Spatial-Channel Attention (VSCA) module. The global and local channel attention modules of VSCA enhance the model’s sensitivity to local feature variations within a voxel, enabling it to distinguish different objects within the same voxel. Additionally, the global and local spatial attention modules of VSCA identify relationships between different parts of an object scattered across multiple voxels, allowing the network to better represent entire objects. By integrating various geometric features, VoxSCT enhances the representation of small objects. Ultimately, it reassigns voxel features to the original points through a cross-attention module, utilizing the original points for classification and regression, thereby improving the precision of 3D bounding-boxes. The proposed VoxSCT combines the accuracy of point-based models with the efficiency of voxel-based models, making it a promising alternative for voxel-based backbones. VoxSCT achieves mAP scores of 78.22% and 70.56% on LEVEL 1 and LEVEL 2 of the vehicle category in the Waymo validation 3D detection benchmark.</div></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"195 \",\"pages\":\"Pages 37-43\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016786552500176X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016786552500176X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A Transformer based on Voxel Spatial-Channel Attention for 3D object detection
Existing voxel-based object detection methods primarily use convolution or sparse convolution for feature extraction, followed by the classification and regression tasks based on voxel features. However, the coarse representation of point clouds by voxels can limit the ability to capture small object features, and the precision of 3D bounding-box regression may also be compromised, ultimately impacting detection accuracy. To address this issue, we propose a novel voxel-based architecture, the Voxel Spatial-Channel Transformer (VoxSCT), to detect 3D objects from point clouds through point-to-point translation. VoxSCT is constructed based on a Voxel-based Spatial-Channel Attention (VSCA) module. The global and local channel attention modules of VSCA enhance the model’s sensitivity to local feature variations within a voxel, enabling it to distinguish different objects within the same voxel. Additionally, the global and local spatial attention modules of VSCA identify relationships between different parts of an object scattered across multiple voxels, allowing the network to better represent entire objects. By integrating various geometric features, VoxSCT enhances the representation of small objects. Ultimately, it reassigns voxel features to the original points through a cross-attention module, utilizing the original points for classification and regression, thereby improving the precision of 3D bounding-boxes. The proposed VoxSCT combines the accuracy of point-based models with the efficiency of voxel-based models, making it a promising alternative for voxel-based backbones. VoxSCT achieves mAP scores of 78.22% and 70.56% on LEVEL 1 and LEVEL 2 of the vehicle category in the Waymo validation 3D detection benchmark.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.