Dynamic clustering transformer for LiDAR-based 3D object detection

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-09-18 DOI:10.1016/j.patcog.2025.112444

Yubo Cui , Zhiheng Li , Zheng Fang

{"title":"Dynamic clustering transformer for LiDAR-based 3D object detection","authors":"Yubo Cui , Zhiheng Li , Zheng Fang","doi":"10.1016/j.patcog.2025.112444","DOIUrl":null,"url":null,"abstract":"<div><div>LiDAR perception is a critical task in 3D computer vision. Currently, inspired by the success of vision transformers in 2D images, many LiDAR-based detectors also partition the whole scene point cloud into non-overlapping windows, and perform window attention and window shifting to capture local and global information respectively. While these methods improved performance of LiDAR detection task, they often fail to account for the intrinsic separability of 3D LiDAR point clouds. Unlike 2D images, where objects can overlap and blend into one another, objects in LiDAR are distinct and non-overlapping. In this paper, building upon this insight, we propose the Dynamic Cluster Transformer (DCT), a clustering-based point cloud backbone that incorporates transformer architecture. Our approach is designed to exploit the unique characteristics of LiDAR point clouds, enabling a more efficient 3D feature extraction. Specifically, the DCT architecture comprises two primary modules: Sparse Cluster Generation (SCG) and Cluster Feature Interaction (CFI). The Sparse Cluster Generation is responsible for producing initial sparse cluster features from the entire scene point cloud, providing a basis for local and global feature propagation. The Cluster Feature Interaction then facilitates information propagation between these clusters and surrounding voxels, allowing for a more comprehensive understanding of the spatial relationships. This proposed clustering-based learning process is simple yet effective, conforming to the physical characteristics of LiDAR point clouds. Empirical results demonstrate that DCT achieves state-of-the-art performance on the large-scale Waymo Open Dataset and nuScenes dataset.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112444"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011069","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

LiDAR perception is a critical task in 3D computer vision. Currently, inspired by the success of vision transformers in 2D images, many LiDAR-based detectors also partition the whole scene point cloud into non-overlapping windows, and perform window attention and window shifting to capture local and global information respectively. While these methods improved performance of LiDAR detection task, they often fail to account for the intrinsic separability of 3D LiDAR point clouds. Unlike 2D images, where objects can overlap and blend into one another, objects in LiDAR are distinct and non-overlapping. In this paper, building upon this insight, we propose the Dynamic Cluster Transformer (DCT), a clustering-based point cloud backbone that incorporates transformer architecture. Our approach is designed to exploit the unique characteristics of LiDAR point clouds, enabling a more efficient 3D feature extraction. Specifically, the DCT architecture comprises two primary modules: Sparse Cluster Generation (SCG) and Cluster Feature Interaction (CFI). The Sparse Cluster Generation is responsible for producing initial sparse cluster features from the entire scene point cloud, providing a basis for local and global feature propagation. The Cluster Feature Interaction then facilitates information propagation between these clusters and surrounding voxels, allowing for a more comprehensive understanding of the spatial relationships. This proposed clustering-based learning process is simple yet effective, conforming to the physical characteristics of LiDAR point clouds. Empirical results demonstrate that DCT achieves state-of-the-art performance on the large-scale Waymo Open Dataset and nuScenes dataset.

查看原文本刊更多论文

基于lidar的三维目标检测动态聚类变压器

激光雷达感知是三维计算机视觉中的一项关键任务。目前，受2D图像视觉变换成功的启发，许多基于lidar的检测器也将整个场景点云划分为不重叠的窗口，并分别进行窗口关注和窗口移动来捕获局部和全局信息。虽然这些方法提高了激光雷达探测任务的性能，但它们往往不能考虑三维激光雷达点云的固有可分离性。与2D图像不同，2D图像中的物体可以相互重叠和融合，而激光雷达中的物体是独特的，不重叠的。在本文中，基于这一见解，我们提出了动态集群变压器（DCT），这是一个基于集群的点云骨干，包含变压器架构。我们的方法旨在利用激光雷达点云的独特特征，实现更有效的3D特征提取。具体来说，DCT架构包括两个主要模块：稀疏聚类生成（SCG）和聚类特征交互（CFI）。稀疏聚类生成负责从整个场景点云生成初始稀疏聚类特征，为局部和全局特征传播提供基础。然后，集群特征交互促进了这些集群和周围体素之间的信息传播，从而可以更全面地理解空间关系。本文提出的基于聚类的学习过程简单有效，符合激光雷达点云的物理特性。实证结果表明，DCT在大规模Waymo开放数据集和nuScenes数据集上达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.