{"title":"Dynamic clustering transformer for LiDAR-based 3D object detection","authors":"Yubo Cui , Zhiheng Li , Zheng Fang","doi":"10.1016/j.patcog.2025.112444","DOIUrl":null,"url":null,"abstract":"<div><div>LiDAR perception is a critical task in 3D computer vision. Currently, inspired by the success of vision transformers in 2D images, many LiDAR-based detectors also partition the whole scene point cloud into non-overlapping windows, and perform window attention and window shifting to capture local and global information respectively. While these methods improved performance of LiDAR detection task, they often fail to account for the intrinsic separability of 3D LiDAR point clouds. Unlike 2D images, where objects can overlap and blend into one another, objects in LiDAR are distinct and non-overlapping. In this paper, building upon this insight, we propose the Dynamic Cluster Transformer (DCT), a clustering-based point cloud backbone that incorporates transformer architecture. Our approach is designed to exploit the unique characteristics of LiDAR point clouds, enabling a more efficient 3D feature extraction. Specifically, the DCT architecture comprises two primary modules: Sparse Cluster Generation (SCG) and Cluster Feature Interaction (CFI). The Sparse Cluster Generation is responsible for producing initial sparse cluster features from the entire scene point cloud, providing a basis for local and global feature propagation. The Cluster Feature Interaction then facilitates information propagation between these clusters and surrounding voxels, allowing for a more comprehensive understanding of the spatial relationships. This proposed clustering-based learning process is simple yet effective, conforming to the physical characteristics of LiDAR point clouds. Empirical results demonstrate that DCT achieves state-of-the-art performance on the large-scale Waymo Open Dataset and nuScenes dataset.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112444"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011069","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
LiDAR perception is a critical task in 3D computer vision. Currently, inspired by the success of vision transformers in 2D images, many LiDAR-based detectors also partition the whole scene point cloud into non-overlapping windows, and perform window attention and window shifting to capture local and global information respectively. While these methods improved performance of LiDAR detection task, they often fail to account for the intrinsic separability of 3D LiDAR point clouds. Unlike 2D images, where objects can overlap and blend into one another, objects in LiDAR are distinct and non-overlapping. In this paper, building upon this insight, we propose the Dynamic Cluster Transformer (DCT), a clustering-based point cloud backbone that incorporates transformer architecture. Our approach is designed to exploit the unique characteristics of LiDAR point clouds, enabling a more efficient 3D feature extraction. Specifically, the DCT architecture comprises two primary modules: Sparse Cluster Generation (SCG) and Cluster Feature Interaction (CFI). The Sparse Cluster Generation is responsible for producing initial sparse cluster features from the entire scene point cloud, providing a basis for local and global feature propagation. The Cluster Feature Interaction then facilitates information propagation between these clusters and surrounding voxels, allowing for a more comprehensive understanding of the spatial relationships. This proposed clustering-based learning process is simple yet effective, conforming to the physical characteristics of LiDAR point clouds. Empirical results demonstrate that DCT achieves state-of-the-art performance on the large-scale Waymo Open Dataset and nuScenes dataset.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.