OBFS: OpenCL Based BFS Optimizations on Software Programmable FPGAs

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI:10.1109/ICFPT47387.2019.00056

Cheng Liu, Xinyu Chen, Bingsheng He, Xiaofei Liao, Ying Wang, Lei Zhang

{"title":"OBFS: OpenCL Based BFS Optimizations on Software Programmable FPGAs","authors":"Cheng Liu, Xinyu Chen, Bingsheng He, Xiaofei Liao, Ying Wang, Lei Zhang","doi":"10.1109/ICFPT47387.2019.00056","DOIUrl":null,"url":null,"abstract":"Breadth First Search (BFS) is a key building block of graph processing and there have been considerable efforts devoted to accelerating BFS on FPGAs for both performance and energy efficiency. Prior work typically built the BFS accelerator through handcrafted circuit design using hardware description language (HDL). Despite the relatively good performance, the HDL based design leads to extremely low design productivity, and incurs high portability and maintenance cost. While high level synthesis (HLS) tools make it convenient to create a functionally correct BFS accelerator, the performance can be much lower the handcrafted design with HDL. To obtain both the near handcrafted design performance and better software-like features such as portability and maintenance, we propose OBFS, an OpenCL based BFS accelerator on software programmable FPGAs. With the observation that OpenCL based FPGA design is rather inefficient on irregular memory accesses, we propose approaches including data alignment, graph reordering and batching to ensure coalesced memory accesses. In addition, we take advantage of the on-chip buffer to mitigate the inefficient random DDR accesses. Finally, we shift the random level update in BFS out from the main processing pipeline and have it overlapped with the following BFS processing task. According to the experiments, OBFS achieves 9.5X and 5.5X performance speedup on average compared to a vertex-centric implementation and an edge-centric implementation respectively on Intel Harp-v2. When compared to prior handcrafted designs, it achieves comparable or even better performance.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Breadth First Search (BFS) is a key building block of graph processing and there have been considerable efforts devoted to accelerating BFS on FPGAs for both performance and energy efficiency. Prior work typically built the BFS accelerator through handcrafted circuit design using hardware description language (HDL). Despite the relatively good performance, the HDL based design leads to extremely low design productivity, and incurs high portability and maintenance cost. While high level synthesis (HLS) tools make it convenient to create a functionally correct BFS accelerator, the performance can be much lower the handcrafted design with HDL. To obtain both the near handcrafted design performance and better software-like features such as portability and maintenance, we propose OBFS, an OpenCL based BFS accelerator on software programmable FPGAs. With the observation that OpenCL based FPGA design is rather inefficient on irregular memory accesses, we propose approaches including data alignment, graph reordering and batching to ensure coalesced memory accesses. In addition, we take advantage of the on-chip buffer to mitigate the inefficient random DDR accesses. Finally, we shift the random level update in BFS out from the main processing pipeline and have it overlapped with the following BFS processing task. According to the experiments, OBFS achieves 9.5X and 5.5X performance speedup on average compared to a vertex-centric implementation and an edge-centric implementation respectively on Intel Harp-v2. When compared to prior handcrafted designs, it achieves comparable or even better performance.

查看原文本刊更多论文

基于OpenCL的软件可编程fpga的BFS优化

广度优先搜索(BFS)是图形处理的关键组成部分，在fpga上加速BFS的性能和能源效率方面已经付出了相当大的努力。先前的工作通常是通过使用硬件描述语言(HDL)手工设计电路来构建BFS加速器。尽管具有较好的性能，但是基于HDL的设计导致了极低的设计生产率，并且产生了很高的可移植性和维护成本。虽然高级合成(HLS)工具可以方便地创建功能正确的BFS加速器，但使用HDL手工设计的性能可能会低得多。为了获得接近手工制作的设计性能和更好的类似软件的特性，如可移植性和可维护性，我们提出了基于OpenCL的软件可编程fpga的BFS加速器OBFS。观察到基于OpenCL的FPGA设计在不规则内存访问上效率低下，我们提出了数据对齐、图重排序和批处理等方法来确保合并内存访问。此外，我们利用片上缓冲器来减轻低效率的随机DDR访问。最后，我们将BFS中的随机电平更新从主处理管道中移出，并使其与下面的BFS处理任务重叠。实验表明，在Intel Harp-v2上，与以顶点为中心的实现和以边缘为中心的实现相比，OBFS的平均性能提升分别达到9.5倍和5.5倍。与之前的手工设计相比，它达到了相当甚至更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量