fpga上图形核的IP核

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-09-01 DOI:10.1109/HPEC.2019.8916363

S. Kuppannagari, Rachit Rajat, R. Kannan, A. Dasu, V. Prasanna

{"title":"fpga上图形核的IP核","authors":"S. Kuppannagari, Rachit Rajat, R. Kannan, A. Dasu, V. Prasanna","doi":"10.1109/HPEC.2019.8916363","DOIUrl":null,"url":null,"abstract":"Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into nonoverlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra and interpartition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, 16.4× speedup for PR, 3.5× speedup for SSSP, and 35.1× speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× speedup for SpMV, 1.64× speedup for PR, and 1.8× speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs assuming the graph is stored in HBM2 instead of DRAM. We further discuss extensions to our optimizations to improve the throughput.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"IP Cores for Graph Kernels on FPGAs\",\"authors\":\"S. Kuppannagari, Rachit Rajat, R. Kannan, A. Dasu, V. Prasanna\",\"doi\":\"10.1109/HPEC.2019.8916363\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into nonoverlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra and interpartition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, 16.4× speedup for PR, 3.5× speedup for SSSP, and 35.1× speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× speedup for SpMV, 1.64× speedup for PR, and 1.8× speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs assuming the graph is stored in HBM2 instead of DRAM. We further discuss extensions to our optimizations to improve the throughput.\",\"PeriodicalId\":184253,\"journal\":{\"name\":\"2019 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC.2019.8916363\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2019.8916363","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

在许多实际应用程序中，图是表示网络数据的强大抽象。执行大规模图形分析的需求已经导致为此目的广泛采用专用硬件加速器，如FPGA。在这项工作中，我们为几个关键图核开发了IP核。我们的IP核使用分区上的图形处理(GPOP)编程范例来执行图分区上的计算。将输入图划分为不重叠的分区可以提高片上数据的重用。还讨论了利用分区内和分区间并行性以及减少外部内存访问的其他优化。我们生成了具有各种顶点属性和更新传播函数的通用图算法的FPGA设计，例如稀疏矩阵向量乘法(SpMV)， PageRank (PR)，单源最短路径(SSSP)和弱连接组件(WCC)。我们的目标是一个由大型外部DDR4存储器和Intel Stratix FPGA组成的平台来存储图形数据，以加速处理。实验结果表明，对于SpMV、PR、SSSP和WCC，我们的加速器分别保持高达2250、2300、3378和2178百万遍历边每秒(MTEPS)的高吞吐量。与几种高度优化的多核设计相比，我们的FPGA框架在SpMV上实现了20.5倍的加速，在PR上实现了16.4倍的加速，在SSSP上实现了3.5倍的加速，在WCC上实现了35.1倍的加速，与两种最先进的FPGA框架相比，我们的设计分别在SpMV上实现了5.3倍的加速，在PR上实现了1.64倍的加速，在WCC上实现了1.8倍的加速。我们为GPOP范例开发了一个性能模型。然后，我们假设图形存储在HBM2而不是DRAM中，对我们的设计进行性能预测。我们将进一步讨论优化的扩展，以提高吞吐量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

IP Cores for Graph Kernels on FPGAs

Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into nonoverlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra and interpartition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, 16.4× speedup for PR, 3.5× speedup for SSSP, and 35.1× speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× speedup for SpMV, 1.64× speedup for PR, and 1.8× speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs assuming the graph is stored in HBM2 instead of DRAM. We further discuss extensions to our optimizations to improve the throughput.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量