QuickNN: Memory and Performance Optimization of k-d Tree Based Nearest Neighbor Search for 3D Point Clouds

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI:10.1109/HPCA47549.2020.00024

Reid Pinkham, Shuqing Zeng, Zhengya Zhang

{"title":"QuickNN: Memory and Performance Optimization of k-d Tree Based Nearest Neighbor Search for 3D Point Clouds","authors":"Reid Pinkham, Shuqing Zeng, Zhengya Zhang","doi":"10.1109/HPCA47549.2020.00024","DOIUrl":null,"url":null,"abstract":"The use of Light Detection And Ranging (LiDAR) has enabled the continued improvement in accuracy and performance of autonomous navigation. The latest applications require LiDAR's of the highest spatial resolution, which generate a massive amount of 3D point clouds that need to be processed in real time. In this work, we investigate the architecture design for k-Nearest Neighbor (kNN) search, an important processing kernel for 3D point clouds. An approximate kNN search based on a k-dimensional (k-d) tree is employed to improve performance. However, even for today's moderate-sized problems, this approximate kNN search is severely hindered by memory bandwidth due to numerous random accesses and minimal data reuse opportunities. We apply several memory optimization schemes to alleviate the bandwidth bottleneck: 1) the k-d tree data structure is partitioned to two sets: tree nodes and point buckets, based on their distinct characteristics – tree nodes that have high reuse are cached for their lifetime to facilitate search, while point buckets with low reuse are organized in regular contiguous segments in external memory to facilitate efficient burst access; 2) write and read caches are added to gather random accesses to transform them to sequential accesses; and 3) tree construction and tree search are interleaved to cut redundant access streams. With optimized memory bandwidth, the kNN search can be further accelerated by two new processing schemes: 1) parallel tree traversal that utilizes multiple workers with minimal tree duplication overhead, and 2) incremental tree building that minimizes the overhead of tree construction by dynamically updating the tree instead of building it from scratch every time. We demonstrate the performance and memory-optimized QuickNN architecture on FPGA and perform exhaustive benchmarking, showing that up to a 19× and 7.3× speedup over k-d tree searches performed on a modern CPU and GPU, respectively, and a 14.5× speedup over a comparable sized architecture performing an exact search. Finally, we show that QuickNN achieves two orders of magnitude performance per watt increase over CPU and GPU methods.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA47549.2020.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

The use of Light Detection And Ranging (LiDAR) has enabled the continued improvement in accuracy and performance of autonomous navigation. The latest applications require LiDAR's of the highest spatial resolution, which generate a massive amount of 3D point clouds that need to be processed in real time. In this work, we investigate the architecture design for k-Nearest Neighbor (kNN) search, an important processing kernel for 3D point clouds. An approximate kNN search based on a k-dimensional (k-d) tree is employed to improve performance. However, even for today's moderate-sized problems, this approximate kNN search is severely hindered by memory bandwidth due to numerous random accesses and minimal data reuse opportunities. We apply several memory optimization schemes to alleviate the bandwidth bottleneck: 1) the k-d tree data structure is partitioned to two sets: tree nodes and point buckets, based on their distinct characteristics – tree nodes that have high reuse are cached for their lifetime to facilitate search, while point buckets with low reuse are organized in regular contiguous segments in external memory to facilitate efficient burst access; 2) write and read caches are added to gather random accesses to transform them to sequential accesses; and 3) tree construction and tree search are interleaved to cut redundant access streams. With optimized memory bandwidth, the kNN search can be further accelerated by two new processing schemes: 1) parallel tree traversal that utilizes multiple workers with minimal tree duplication overhead, and 2) incremental tree building that minimizes the overhead of tree construction by dynamically updating the tree instead of building it from scratch every time. We demonstrate the performance and memory-optimized QuickNN architecture on FPGA and perform exhaustive benchmarking, showing that up to a 19× and 7.3× speedup over k-d tree searches performed on a modern CPU and GPU, respectively, and a 14.5× speedup over a comparable sized architecture performing an exact search. Finally, we show that QuickNN achieves two orders of magnitude performance per watt increase over CPU and GPU methods.

查看原文本刊更多论文

基于k-d树的3D点云最近邻搜索的内存和性能优化

光探测和测距(LiDAR)的使用使自主导航的精度和性能不断提高。最新的应用需要最高空间分辨率的激光雷达，这会产生大量需要实时处理的3D点云。在这项工作中，我们研究了k-最近邻(kNN)搜索的架构设计，这是三维点云的一个重要处理内核。采用基于k维树的近似kNN搜索来提高性能。然而，即使对于今天中等规模的问题，由于大量的随机访问和最小的数据重用机会，这种近似kNN搜索也受到内存带宽的严重阻碍。我们采用了几种内存优化方案来缓解带宽瓶颈:1)基于k-d树数据结构的不同特征，将其划分为树节点和点桶两组:高重用的树节点在其生命周期内缓存以方便搜索，而低重用的点桶在外部存储器中组织成规则的连续段，以方便高效的突发访问;2)增加读写缓存，收集随机访问，将随机访问转换为顺序访问;3)树的构建和树的搜索是交错的，以减少冗余的访问流。通过优化内存带宽，可以通过两种新的处理方案进一步加速kNN搜索:1)并行树遍历，利用多个工作人员以最小的树复制开销，以及2)增量树构建，通过动态更新树而不是每次从头开始构建树来最小化树构建的开销。我们在FPGA上演示了性能和内存优化的QuickNN架构，并进行了详尽的基准测试，结果显示，在现代CPU和GPU上执行的k-d树搜索分别加速了19倍和7.3倍，在执行精确搜索的类似大小的架构上加速了14.5倍。最后，我们表明，与CPU和GPU方法相比，QuickNN每瓦的性能提高了两个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量