CenterFormer:用于3D对象检测的基于中心的变压器

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision Pub Date : 2022-09-12 DOI:10.48550/arXiv.2209.05588

Zixiang Zhou, Xian Zhao, Yu Wang, Panqu Wang, H. Foroosh

{"title":"CenterFormer:用于3D对象检测的基于中心的变压器","authors":"Zixiang Zhou, Xian Zhao, Yu Wang, Panqu Wang, H. Foroosh","doi":"10.48550/arXiv.2209.05588","DOIUrl":null,"url":null,"abstract":"Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at https://github.com/TuSimple/centerformer","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"139 1","pages":"496-513"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":"{\"title\":\"CenterFormer: Center-based Transformer for 3D Object Detection\",\"authors\":\"Zixiang Zhou, Xian Zhao, Yu Wang, Panqu Wang, H. Foroosh\",\"doi\":\"10.48550/arXiv.2209.05588\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at https://github.com/TuSimple/centerformer\",\"PeriodicalId\":72676,\"journal\":{\"name\":\"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision\",\"volume\":\"139 1\",\"pages\":\"496-513\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"51\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2209.05588\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2209.05588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 51

摘要

在许多图像域任务中，基于查询的变换在构建远程注意力方面显示出巨大的潜力，但由于点云数据的压倒性规模，在基于lidar的三维目标检测中很少被考虑。在本文中，我们提出了CenterFormer，一个基于中心的变压器网络，用于三维目标检测。CenterFormer首先使用中心热图在标准的基于体素的点云编码器上选择中心候选者。然后使用中心候选的特征作为查询嵌入到转换器中。为了进一步聚合多帧的特征，我们设计了一种通过交叉关注来融合特征的方法。最后，加入回归头来预测输出中心特征表示上的边界框。我们的设计降低了变压器结构的收敛难度和计算复杂度。结果表明，与无锚点目标检测网络的强基线相比，该方法有显著的改进。CenterFormer在Waymo开放数据集上对单个模型实现了最先进的性能，在验证集上的mAPH为73.7%，在测试集上的mAPH为75.6%，显著优于之前发布的所有基于CNN和变压器的方法。我们的代码可以在https://github.com/TuSimple/centerformer上公开获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CenterFormer: Center-based Transformer for 3D Object Detection

Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at https://github.com/TuSimple/centerformer

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

自引率

0.00%

发文量