ScalaBFS: A Scalable BFS Accelerator on FPGA-HBM Platform

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI:10.1145/3431920.3439463

Chenhao Liu, Zhiyuan Shao, Kexin Li, Minkang Wu, Jiajie Chen, Ruoshi Li, Xiaofei Liao, Hai Jin

{"title":"ScalaBFS: A Scalable BFS Accelerator on FPGA-HBM Platform","authors":"Chenhao Liu, Zhiyuan Shao, Kexin Li, Minkang Wu, Jiajie Chen, Ruoshi Li, Xiaofei Liao, Hai Jin","doi":"10.1145/3431920.3439463","DOIUrl":null,"url":null,"abstract":"High Bandwidth Memory (HBM) provides massive aggregated memory bandwidth by exposing multiple memory channels to the processing units. To achieve high performance, an accelerator built on top of an FPGA configured with HBM (i.e., FPGA-HBM platform) needs to scale its performance according to the available memory channels. In this paper, we propose an accelerator for BFS (Breadth-First Search), named as ScalaBFS, which decouples memory accessing from processing to scale its performance with available HBM memory channels. Moreover, by configuring each HBM memory channel with multiple processing elements, ScalaBFS sufficiently exploits the memory bandwidth of HBM. We implement the prototype system of ScalaBFS and conduct BFS in both real-world and synthetic scale-free graphs on Xilinx Alveo U280 Data Center Accelerator card (real hardware). The experimental results show that ScalaBFS scales its performance almost linearly according to the available memory pseudo channels (PCs) from the HBM2 subsystem of U280. By fully using the 32 PCs and building 64 processing elements (PEs) on U280, ScalaBFS achieves a performance up to 19.7 GTEPS (Giga Traversed Edges Per Second). When conducting BFS in sparse real-world graphs, ScalaBFS achieves equivalent GTEPS to Gunrock running on the state-of-art Nvidia V100 GPU that features 64-PC HBM2 (twice memory bandwidth than U280).","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431920.3439463","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

High Bandwidth Memory (HBM) provides massive aggregated memory bandwidth by exposing multiple memory channels to the processing units. To achieve high performance, an accelerator built on top of an FPGA configured with HBM (i.e., FPGA-HBM platform) needs to scale its performance according to the available memory channels. In this paper, we propose an accelerator for BFS (Breadth-First Search), named as ScalaBFS, which decouples memory accessing from processing to scale its performance with available HBM memory channels. Moreover, by configuring each HBM memory channel with multiple processing elements, ScalaBFS sufficiently exploits the memory bandwidth of HBM. We implement the prototype system of ScalaBFS and conduct BFS in both real-world and synthetic scale-free graphs on Xilinx Alveo U280 Data Center Accelerator card (real hardware). The experimental results show that ScalaBFS scales its performance almost linearly according to the available memory pseudo channels (PCs) from the HBM2 subsystem of U280. By fully using the 32 PCs and building 64 processing elements (PEs) on U280, ScalaBFS achieves a performance up to 19.7 GTEPS (Giga Traversed Edges Per Second). When conducting BFS in sparse real-world graphs, ScalaBFS achieves equivalent GTEPS to Gunrock running on the state-of-art Nvidia V100 GPU that features 64-PC HBM2 (twice memory bandwidth than U280).

查看原文本刊更多论文

基于FPGA-HBM平台的可扩展BFS加速器

高带宽内存(HBM)通过向处理单元公开多个内存通道来提供大量聚合内存带宽。为了实现高性能，在配置HBM(即FPGA-HBM平台)的FPGA上构建的加速器需要根据可用的内存通道扩展其性能。在本文中，我们提出了一个BFS(广度优先搜索)加速器，称为ScalaBFS，它将内存访问与处理解耦，从而利用可用的HBM内存通道扩展其性能。此外，通过为每个HBM内存通道配置多个处理元素，ScalaBFS充分利用了HBM的内存带宽。我们实现了ScalaBFS的原型系统，并在Xilinx Alveo U280数据中心加速器卡(真实硬件)上进行了真实和合成无标度图形的BFS。实验结果表明，ScalaBFS的性能几乎是根据U280的HBM2子系统的可用内存伪通道(pc)线性扩展的。通过充分利用32台pc并在U280上构建64个处理元素(pe)， ScalaBFS实现了高达19.7 GTEPS(每秒千兆遍历边缘)的性能。当在稀疏的真实世界图形中进行BFS时，ScalaBFS在最先进的Nvidia V100 GPU上实现相当于Gunrock的GTEPS，该GPU具有64-PC HBM2(内存带宽是U280的两倍)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量