Hardware thread reordering to boost OpenCL throughput on FPGAs

2016 IEEE 34th International Conference on Computer Design (ICCD) Pub Date : 2016-10-01 DOI:10.1109/ICCD.2016.7753288

Amir Momeni, H. Tabkhi, G. Schirner, D. Kaeli

{"title":"Hardware thread reordering to boost OpenCL throughput on FPGAs","authors":"Amir Momeni, H. Tabkhi, G. Schirner, D. Kaeli","doi":"10.1109/ICCD.2016.7753288","DOIUrl":null,"url":null,"abstract":"Availability of OpenCL for FPGAs has raised new questions about the efficiency of massive thread-level parallelism on FPGAs. The general trend is toward creating deep pipelining and in-order execution of many OpenCL threads across a shared data-path. While this can be a very effective approach for regular kernels, its efficiency significantly diminishes for irregular kernels with runtime-dependent control flow. We need to look for new approaches to improve execution efficiency of FPGAs when targeting irregular OpenCL kernels. This paper proposes a novel solution, called Hardware Thread Reordering (HTR), to boost the throughput of the FPGAs when executing irregular kernels possessing non-deterministic runtime control flow. The key insight of HRT is out-of-order OpenCL thread execution over a shared data-path to achieve significantly higher throughput. The thread reordering is performed at a basic-block level granularity. The synthesized basic-blocks are extended with independent pipeline control signals and context registers to bypass the live values of reordered threads. We demonstrate the efficiency of our proposed solution on three parallel irregular kernels. For the experiments, we utilize the LegUp tool to compare the baseline (in-order) data-path with HTR-enhanced data-path. Our RTL simulation results demonstrate that HTR-enhanced data-path achieves up to 11× increase in kernels throughput at a very low overhead (less than 2× increase in FPGA resources).","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 34th International Conference on Computer Design (ICCD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD.2016.7753288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Availability of OpenCL for FPGAs has raised new questions about the efficiency of massive thread-level parallelism on FPGAs. The general trend is toward creating deep pipelining and in-order execution of many OpenCL threads across a shared data-path. While this can be a very effective approach for regular kernels, its efficiency significantly diminishes for irregular kernels with runtime-dependent control flow. We need to look for new approaches to improve execution efficiency of FPGAs when targeting irregular OpenCL kernels. This paper proposes a novel solution, called Hardware Thread Reordering (HTR), to boost the throughput of the FPGAs when executing irregular kernels possessing non-deterministic runtime control flow. The key insight of HRT is out-of-order OpenCL thread execution over a shared data-path to achieve significantly higher throughput. The thread reordering is performed at a basic-block level granularity. The synthesized basic-blocks are extended with independent pipeline control signals and context registers to bypass the live values of reordered threads. We demonstrate the efficiency of our proposed solution on three parallel irregular kernels. For the experiments, we utilize the LegUp tool to compare the baseline (in-order) data-path with HTR-enhanced data-path. Our RTL simulation results demonstrate that HTR-enhanced data-path achieves up to 11× increase in kernels throughput at a very low overhead (less than 2× increase in FPGA resources).

查看原文本刊更多论文

硬件线程重新排序以提高fpga上的OpenCL吞吐量

fpga的OpenCL的可用性提出了关于fpga上大规模线程级并行的效率的新问题。总的趋势是创建深度管道，并在共享数据路径上按顺序执行许多OpenCL线程。虽然这对于规则内核是一种非常有效的方法，但对于具有运行时依赖控制流的不规则内核，其效率会显著降低。当针对不规则的OpenCL内核时，我们需要寻找新的方法来提高fpga的执行效率。本文提出了一种新的解决方案，称为硬件线程重排序(HTR)，以提高fpga在执行具有不确定性运行时控制流的不规则内核时的吞吐量。HRT的关键是在共享数据路径上乱序执行OpenCL线程，以实现更高的吞吐量。线程重新排序是在基本块级别粒度上执行的。用独立的管道控制信号和上下文寄存器扩展合成的基本块，以绕过重排序线程的活值。我们在三个平行的不规则核上证明了我们所提出的解决方案的有效性。对于实验，我们利用LegUp工具比较基线(有序)数据路径与htr增强数据路径。我们的RTL模拟结果表明，htr增强的数据路径在非常低的开销(FPGA资源增加不到2倍)下实现了内核吞吐量增加高达11倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE 34th International Conference on Computer Design (ICCD)

自引率

0.00%

发文量