{"title":"OBFS: OpenCL Based BFS Optimizations on Software Programmable FPGAs","authors":"Cheng Liu, Xinyu Chen, Bingsheng He, Xiaofei Liao, Ying Wang, Lei Zhang","doi":"10.1109/ICFPT47387.2019.00056","DOIUrl":null,"url":null,"abstract":"Breadth First Search (BFS) is a key building block of graph processing and there have been considerable efforts devoted to accelerating BFS on FPGAs for both performance and energy efficiency. Prior work typically built the BFS accelerator through handcrafted circuit design using hardware description language (HDL). Despite the relatively good performance, the HDL based design leads to extremely low design productivity, and incurs high portability and maintenance cost. While high level synthesis (HLS) tools make it convenient to create a functionally correct BFS accelerator, the performance can be much lower the handcrafted design with HDL. To obtain both the near handcrafted design performance and better software-like features such as portability and maintenance, we propose OBFS, an OpenCL based BFS accelerator on software programmable FPGAs. With the observation that OpenCL based FPGA design is rather inefficient on irregular memory accesses, we propose approaches including data alignment, graph reordering and batching to ensure coalesced memory accesses. In addition, we take advantage of the on-chip buffer to mitigate the inefficient random DDR accesses. Finally, we shift the random level update in BFS out from the main processing pipeline and have it overlapped with the following BFS processing task. According to the experiments, OBFS achieves 9.5X and 5.5X performance speedup on average compared to a vertex-centric implementation and an edge-centric implementation respectively on Intel Harp-v2. When compared to prior handcrafted designs, it achieves comparable or even better performance.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Breadth First Search (BFS) is a key building block of graph processing and there have been considerable efforts devoted to accelerating BFS on FPGAs for both performance and energy efficiency. Prior work typically built the BFS accelerator through handcrafted circuit design using hardware description language (HDL). Despite the relatively good performance, the HDL based design leads to extremely low design productivity, and incurs high portability and maintenance cost. While high level synthesis (HLS) tools make it convenient to create a functionally correct BFS accelerator, the performance can be much lower the handcrafted design with HDL. To obtain both the near handcrafted design performance and better software-like features such as portability and maintenance, we propose OBFS, an OpenCL based BFS accelerator on software programmable FPGAs. With the observation that OpenCL based FPGA design is rather inefficient on irregular memory accesses, we propose approaches including data alignment, graph reordering and batching to ensure coalesced memory accesses. In addition, we take advantage of the on-chip buffer to mitigate the inefficient random DDR accesses. Finally, we shift the random level update in BFS out from the main processing pipeline and have it overlapped with the following BFS processing task. According to the experiments, OBFS achieves 9.5X and 5.5X performance speedup on average compared to a vertex-centric implementation and an edge-centric implementation respectively on Intel Harp-v2. When compared to prior handcrafted designs, it achieves comparable or even better performance.