Chenhao Liu, Zhiyuan Shao, Kexin Li, Minkang Wu, Jiajie Chen, Ruoshi Li, Xiaofei Liao, Hai Jin
{"title":"ScalaBFS: A Scalable BFS Accelerator on FPGA-HBM Platform","authors":"Chenhao Liu, Zhiyuan Shao, Kexin Li, Minkang Wu, Jiajie Chen, Ruoshi Li, Xiaofei Liao, Hai Jin","doi":"10.1145/3431920.3439463","DOIUrl":null,"url":null,"abstract":"High Bandwidth Memory (HBM) provides massive aggregated memory bandwidth by exposing multiple memory channels to the processing units. To achieve high performance, an accelerator built on top of an FPGA configured with HBM (i.e., FPGA-HBM platform) needs to scale its performance according to the available memory channels. In this paper, we propose an accelerator for BFS (Breadth-First Search), named as ScalaBFS, which decouples memory accessing from processing to scale its performance with available HBM memory channels. Moreover, by configuring each HBM memory channel with multiple processing elements, ScalaBFS sufficiently exploits the memory bandwidth of HBM. We implement the prototype system of ScalaBFS and conduct BFS in both real-world and synthetic scale-free graphs on Xilinx Alveo U280 Data Center Accelerator card (real hardware). The experimental results show that ScalaBFS scales its performance almost linearly according to the available memory pseudo channels (PCs) from the HBM2 subsystem of U280. By fully using the 32 PCs and building 64 processing elements (PEs) on U280, ScalaBFS achieves a performance up to 19.7 GTEPS (Giga Traversed Edges Per Second). When conducting BFS in sparse real-world graphs, ScalaBFS achieves equivalent GTEPS to Gunrock running on the state-of-art Nvidia V100 GPU that features 64-PC HBM2 (twice memory bandwidth than U280).","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431920.3439463","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
High Bandwidth Memory (HBM) provides massive aggregated memory bandwidth by exposing multiple memory channels to the processing units. To achieve high performance, an accelerator built on top of an FPGA configured with HBM (i.e., FPGA-HBM platform) needs to scale its performance according to the available memory channels. In this paper, we propose an accelerator for BFS (Breadth-First Search), named as ScalaBFS, which decouples memory accessing from processing to scale its performance with available HBM memory channels. Moreover, by configuring each HBM memory channel with multiple processing elements, ScalaBFS sufficiently exploits the memory bandwidth of HBM. We implement the prototype system of ScalaBFS and conduct BFS in both real-world and synthetic scale-free graphs on Xilinx Alveo U280 Data Center Accelerator card (real hardware). The experimental results show that ScalaBFS scales its performance almost linearly according to the available memory pseudo channels (PCs) from the HBM2 subsystem of U280. By fully using the 32 PCs and building 64 processing elements (PEs) on U280, ScalaBFS achieves a performance up to 19.7 GTEPS (Giga Traversed Edges Per Second). When conducting BFS in sparse real-world graphs, ScalaBFS achieves equivalent GTEPS to Gunrock running on the state-of-art Nvidia V100 GPU that features 64-PC HBM2 (twice memory bandwidth than U280).