VoxT-GNN: A 3D object detection approach from point cloud based on voxel-level transformer and graph neural network

IF 7.4 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-03-27 DOI:10.1016/j.ipm.2025.104155

Qiangwen Zheng , Sheng Wu , Jinghui Wei

{"title":"VoxT-GNN: A 3D object detection approach from point cloud based on voxel-level transformer and graph neural network","authors":"Qiangwen Zheng , Sheng Wu , Jinghui Wei","doi":"10.1016/j.ipm.2025.104155","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, a variety of LiDAR-based methods for the 3D detection of single-class objects, large objects, or in straightforward scenes have exhibited competitive performance. However, their detection performance in complex scenarios with multi - sized and multi - class objects is limited. We observe that the core problem leading to this phenomenon is the insufficient feature learning of small objects in point clouds, making it difficult to obtain more discriminative features. To address this challenge, we propose a 3D object detection framework based on point clouds that takes into account the detection of small objects, termed VoxT-GNN. The framework comprises two core components: a Voxel-Level Transformer (VoxelFormer) for local feature learning and a Graph Neural Network Feed-Forward Network (GnnFFN) for global feature learning. By embedding GnnFFN as an intermediate layer between the encoder and decoder of VoxelFormer, we achieve flexible scaling of the global receptive field while maximally preserving the original point cloud structure. This design enables effective adaptation to objects of varying sizes and categories, providing a viable solution for detection applications across diverse scenarios. Extensive experiments on KITTI and Waymo Open Dataset (WOD) demonstrate the strong competitiveness of our method, particularly showing significant improvements in small object detection. Notably, our approach achieves the second-highest mAP of 65.44% across three categories (car, pedestrian, and cyclist) on KITTI benchmark. The source code is available at <span><span>https://github.com/yujianxinnian/VoxT-GNN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 4","pages":"Article 104155"},"PeriodicalIF":7.4000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325000962","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, a variety of LiDAR-based methods for the 3D detection of single-class objects, large objects, or in straightforward scenes have exhibited competitive performance. However, their detection performance in complex scenarios with multi - sized and multi - class objects is limited. We observe that the core problem leading to this phenomenon is the insufficient feature learning of small objects in point clouds, making it difficult to obtain more discriminative features. To address this challenge, we propose a 3D object detection framework based on point clouds that takes into account the detection of small objects, termed VoxT-GNN. The framework comprises two core components: a Voxel-Level Transformer (VoxelFormer) for local feature learning and a Graph Neural Network Feed-Forward Network (GnnFFN) for global feature learning. By embedding GnnFFN as an intermediate layer between the encoder and decoder of VoxelFormer, we achieve flexible scaling of the global receptive field while maximally preserving the original point cloud structure. This design enables effective adaptation to objects of varying sizes and categories, providing a viable solution for detection applications across diverse scenarios. Extensive experiments on KITTI and Waymo Open Dataset (WOD) demonstrate the strong competitiveness of our method, particularly showing significant improvements in small object detection. Notably, our approach achieves the second-highest mAP of 65.44% across three categories (car, pedestrian, and cyclist) on KITTI benchmark. The source code is available at https://github.com/yujianxinnian/VoxT-GNN.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.