Accelerating Large-Scale Graph Analytics with FPGA and HMC

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI:10.1109/FCCM.2017.58

Soroosh Khoram, Jialiang Zhang, Maxwell Strange, J. Li

{"title":"Accelerating Large-Scale Graph Analytics with FPGA and HMC","authors":"Soroosh Khoram, Jialiang Zhang, Maxwell Strange, J. Li","doi":"10.1109/FCCM.2017.58","DOIUrl":null,"url":null,"abstract":"Graph analytics that explores the relationship among interconnected entities is becoming increasingly important due to its broad applicability from machine learning to social science. However, one major challenge for graph processing systems is the irregular data access pattern of graph computation which can significantly degrade the performance. The algorithms, software, and hardware that have been tailored for mainstream parallel applications are, as a result, generally not effective for massive-scale sparse graphs from the real world due to their complexity and irregularity. To address the performance issues in large-scale graph analytics, we combine the emerging Hybrid Memory Cube (HMC) with a modern FPGA in order to achieve exceptional random access performance without any loss of flexibility or efficiency in computation. In particular, we develop collaborative software/hardware techniques to perform a level-synchronized breadth first search (BFS) on the FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that fully exploits the platform's capability to improve data locality and memory access efficiency. For each input graph, this algorithm provides an efficient data layout that allows the FPGA to coalesce memory requests into the largest possible HMC payload requests so that the number of memory requests, which is the primary factor in runtime, can be minimized. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by adding a merging unit. The merging unit takes the best advantage of the increased data locality resulting from graph clustering. We evaluated the performance of our BFS implementation using the AC-510 development kit from Micron over a set of benchmarks from a wide range of applications. We observed that the combination of the clustering algorithm and the merging hardware achieved 2.8 × average performance improvement compared to the latest FPGA-HMC based graph processing system.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2017.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Graph analytics that explores the relationship among interconnected entities is becoming increasingly important due to its broad applicability from machine learning to social science. However, one major challenge for graph processing systems is the irregular data access pattern of graph computation which can significantly degrade the performance. The algorithms, software, and hardware that have been tailored for mainstream parallel applications are, as a result, generally not effective for massive-scale sparse graphs from the real world due to their complexity and irregularity. To address the performance issues in large-scale graph analytics, we combine the emerging Hybrid Memory Cube (HMC) with a modern FPGA in order to achieve exceptional random access performance without any loss of flexibility or efficiency in computation. In particular, we develop collaborative software/hardware techniques to perform a level-synchronized breadth first search (BFS) on the FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that fully exploits the platform's capability to improve data locality and memory access efficiency. For each input graph, this algorithm provides an efficient data layout that allows the FPGA to coalesce memory requests into the largest possible HMC payload requests so that the number of memory requests, which is the primary factor in runtime, can be minimized. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by adding a merging unit. The merging unit takes the best advantage of the increased data locality resulting from graph clustering. We evaluated the performance of our BFS implementation using the AC-510 development kit from Micron over a set of benchmarks from a wide range of applications. We observed that the combination of the clustering algorithm and the merging hardware achieved 2.8 × average performance improvement compared to the latest FPGA-HMC based graph processing system.

查看原文本刊更多论文

用FPGA和HMC加速大规模图形分析

由于从机器学习到社会科学的广泛适用性，探索相互关联实体之间关系的图形分析变得越来越重要。然而，图处理系统面临的一个主要挑战是图计算的不规则数据访问模式，这会大大降低性能。因此，为主流并行应用程序量身定制的算法、软件和硬件，由于其复杂性和不规则性，通常对来自现实世界的大规模稀疏图不有效。为了解决大规模图形分析中的性能问题，我们将新兴的混合内存立方体(HMC)与现代FPGA结合起来，以实现卓越的随机访问性能，而不会损失计算的灵活性或效率。特别是，我们开发了协作软件/硬件技术，以在FPGA-HMC平台上执行水平同步广度优先搜索(BFS)。从软件的角度来看，我们开发了一个架构感知的图聚类算法，充分利用平台的能力来提高数据局部性和内存访问效率。对于每个输入图，该算法提供了一种有效的数据布局，允许FPGA将内存请求合并到最大可能的HMC有效负载请求中，以便内存请求的数量(运行时的主要因素)可以最小化。从硬件角度来看，我们通过增加合并单元进一步改进了FPGA-HMC图形处理器架构。合并单元充分利用了图聚类所增加的数据局部性。我们使用美光的AC-510开发套件，在一系列广泛应用的基准测试中评估了我们的BFS实现的性能。我们观察到，与最新的基于FPGA-HMC的图形处理系统相比，聚类算法和合并硬件的组合实现了2.8倍的平均性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量