Population Count on Intel® CPU, GPU and FPGA

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI:10.1109/IPDPSW50202.2020.00081

Zheming Jin, H. Finkel

{"title":"Population Count on Intel® CPU, GPU and FPGA","authors":"Zheming Jin, H. Finkel","doi":"10.1109/IPDPSW50202.2020.00081","DOIUrl":null,"url":null,"abstract":"Population count is a primitive used in many applications. Commodity processors have dedicated instructions for achieving high-performance population count. Motivated by the productivity of high-level synthesis and the importance of population count, in this paper we investigated the OpenCL implementations of population count algorithms, and evaluated their performance and resource utilizations on an FPGA. Based on the results, we select the most efficient implementation. Then we derived a reduction pattern from a representative application of population count. We parallelized the reduction with atomic functions, and optimized it with vectorized memory accesses, tree reduction, and compute-unit duplication. We evaluated the performance of the reduction kernel on an InteloXeono CPU and an Intel® IrisTM Pro integrated GPU, and an FPGA card that features an Intel® Arria® 10 FPGA. When DRAM memory bandwidth is comparable on the three computing platforms, the FPGA can achieve the highest kernel performance for large workload. On the other hand, we described performance bottlenecks on the FPGA. To make FPGAs more competitive in raw performance compared to high-performant CPU and GPU platforms, it is important to increase external memory bandwidth, minimize data movement between a host and a device, and reduce OpenCL runtime overhead on an FPGA.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW50202.2020.00081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Population count is a primitive used in many applications. Commodity processors have dedicated instructions for achieving high-performance population count. Motivated by the productivity of high-level synthesis and the importance of population count, in this paper we investigated the OpenCL implementations of population count algorithms, and evaluated their performance and resource utilizations on an FPGA. Based on the results, we select the most efficient implementation. Then we derived a reduction pattern from a representative application of population count. We parallelized the reduction with atomic functions, and optimized it with vectorized memory accesses, tree reduction, and compute-unit duplication. We evaluated the performance of the reduction kernel on an InteloXeono CPU and an Intel® IrisTM Pro integrated GPU, and an FPGA card that features an Intel® Arria® 10 FPGA. When DRAM memory bandwidth is comparable on the three computing platforms, the FPGA can achieve the highest kernel performance for large workload. On the other hand, we described performance bottlenecks on the FPGA. To make FPGAs more competitive in raw performance compared to high-performant CPU and GPU platforms, it is important to increase external memory bandwidth, minimize data movement between a host and a device, and reduce OpenCL runtime overhead on an FPGA.

查看原文本刊更多论文

Intel®CPU、GPU和FPGA的人口统计

人口计数是许多应用程序中使用的原始数据。普通处理器有专门的指令来实现高性能的种群计数。考虑到高级综合的生产力和种群计数的重要性，本文研究了种群计数算法的OpenCL实现，并在FPGA上评估了它们的性能和资源利用率。根据结果，我们选择最有效的实现。然后，我们从人口计数的代表性应用中推导出一个减少模式。我们将原子函数的约简并行化，并通过向量化内存访问、树约简和计算单元复制对其进行优化。我们在InteloXeono CPU和Intel®IrisTM Pro集成GPU以及具有Intel®Arria®10 FPGA的卡上评估了缩减内核的性能。当三种计算平台上的DRAM内存带宽相当时，FPGA可以在大工作负载下实现最高的内核性能。另一方面，我们描述了FPGA上的性能瓶颈。与高性能CPU和GPU平台相比，为了使FPGA在原始性能方面更具竞争力，增加外部内存带宽、最小化主机和设备之间的数据移动以及减少FPGA上的OpenCL运行时开销非常重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量