An FPGA framework for edge-centric graph processing

Shijie Zhou, R. Kannan, Hanqing Zeng, V. Prasanna
{"title":"An FPGA framework for edge-centric graph processing","authors":"Shijie Zhou, R. Kannan, Hanqing Zeng, V. Prasanna","doi":"10.1145/3203217.3203233","DOIUrl":null,"url":null,"abstract":"Many emerging real-world applications require fast processing of large-scale data represented in the form of graphs. In this paper, we design a Field-Programmable Gate Array (FPGA) framework to accelerate graph algorithms based on the edge-centric paradigm. Our design is flexible for accelerating general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). The target platform consists of large external memory to store the graph data and FPGA to accelerate the processing. By taking an edge-centric graph algorithm and hardware resource constraints as inputs, our framework can determine the optimal design parameters and produce an optimized Register-Transfer Level (RTL) FPGA accelerator design. To improve data locality and increase parallelism, we partition the input graph into non-overlapping partitions. This enables our framework to efficiently buffer vertex data in the on-chip memory of FPGA and exploit both inter-partition and intra-partition parallelism. Further, we propose an optimized data layout to improve external memory performance and reduce data communication between FPGA and external memory. Based on our design methodology, we accelerate two fundamental graph algorithms for performance evaluation: Sparse Matrix Vector Multiplication (SpMV) and PageRank (PR). Experimental results show that our accelerators sustain a high throughput of up to 2250 Million Traversed Edges Per Second (MTEPS) and 2487 MTEPS for SpMV and PR, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, and 17.7× speedup for PR, respectively; compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× and 1.8× throughput improvement for SpMV and PR, respectively.","PeriodicalId":127096,"journal":{"name":"Proceedings of the 15th ACM International Conference on Computing Frontiers","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3203217.3203233","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 41

Abstract

Many emerging real-world applications require fast processing of large-scale data represented in the form of graphs. In this paper, we design a Field-Programmable Gate Array (FPGA) framework to accelerate graph algorithms based on the edge-centric paradigm. Our design is flexible for accelerating general graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). The target platform consists of large external memory to store the graph data and FPGA to accelerate the processing. By taking an edge-centric graph algorithm and hardware resource constraints as inputs, our framework can determine the optimal design parameters and produce an optimized Register-Transfer Level (RTL) FPGA accelerator design. To improve data locality and increase parallelism, we partition the input graph into non-overlapping partitions. This enables our framework to efficiently buffer vertex data in the on-chip memory of FPGA and exploit both inter-partition and intra-partition parallelism. Further, we propose an optimized data layout to improve external memory performance and reduce data communication between FPGA and external memory. Based on our design methodology, we accelerate two fundamental graph algorithms for performance evaluation: Sparse Matrix Vector Multiplication (SpMV) and PageRank (PR). Experimental results show that our accelerators sustain a high throughput of up to 2250 Million Traversed Edges Per Second (MTEPS) and 2487 MTEPS for SpMV and PR, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5× speedup for SpMV, and 17.7× speedup for PR, respectively; compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3× and 1.8× throughput improvement for SpMV and PR, respectively.
边缘中心图形处理的FPGA框架
许多新兴的实际应用程序需要快速处理以图形形式表示的大规模数据。在本文中,我们设计了一个现场可编程门阵列(FPGA)框架来加速基于边缘中心范式的图算法。我们的设计可以灵活地加速具有各种顶点属性和更新传播函数的一般图算法,例如稀疏矩阵向量乘法(SpMV), PageRank (PR),单源最短路径(SSSP)和弱连接组件(WCC)。目标平台由存储图形数据的大型外部存储器和加速处理的FPGA组成。通过采用以边为中心的图形算法和硬件资源约束作为输入,我们的框架可以确定最佳设计参数,并产生优化的寄存器-传输级(RTL) FPGA加速器设计。为了提高数据的局部性和增加并行性,我们将输入图划分为不重叠的分区。这使我们的框架能够有效地缓冲FPGA片上存储器中的顶点数据,并利用分区间和分区内的并行性。此外,我们提出了一种优化的数据布局,以提高外部存储器的性能,减少FPGA与外部存储器之间的数据通信。基于我们的设计方法,我们加速了两种用于性能评估的基本图算法:稀疏矩阵向量乘法(SpMV)和PageRank (PR)。实验结果表明,我们的加速器在SpMV和PR上的吞吐量分别高达每秒22.5亿遍历边(MTEPS)和2487 MTEPS。与几种高度优化的多核设计相比,我们的FPGA框架在SpMV和PR上分别实现了20.5倍和17.7倍的加速提升;与两种最先进的FPGA框架相比,我们的设计显示SpMV和PR的吞吐量分别提高了5.3倍和1.8倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信