Optimizing Complex OpenCL Code for FPGA: A Case Study on Finite Automata Traversal

2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2020-12-01 DOI:10.1109/ICPADS51040.2020.00073

Marziyeh Nourian, Mostafa Eghbali Zarch, M. Becchi

{"title":"Optimizing Complex OpenCL Code for FPGA: A Case Study on Finite Automata Traversal","authors":"Marziyeh Nourian, Mostafa Eghbali Zarch, M. Becchi","doi":"10.1109/ICPADS51040.2020.00073","DOIUrl":null,"url":null,"abstract":"While FPGAs have been traditionally considered hard to program, recently there have been efforts aimed to allow the use of high-level programming models and libraries intended for multi-core CPUs and GPUs to program FPGAs. For example, both Intel and Xilinx are now providing toolchains to deploy OpenCL code onto FPGA. However, because the nature of the parallelism offered by GPU and FPGA devices is fundamentally different, OpenCL code optimized for GPU can prove very inefficient on FPGA, in terms of both performance and hardware resource utilization. This paper explores this problem on finite automata traversal. In particular, we consider an OpenCL NFA traversal kernel optimized for GPU but exhibiting FPGA-friendly characteristics, namely: limited memory requirements, lack of synchronization, and SIMD execution. We explore a set of structural code changes, custom and best-practice optimizations to retarget this code to FPGA. We showcase the effect of these optimizations on an Intel Stratix V FPGA board using various NFA topologies from different application domains. Our evaluation shows that, while the resource requirements of the original code exceed the capacity of the FPGA in use, our optimizations lead to significant resource savings and allow the transformed code to fit the FPGA for all considered NFA topologies. In addition, our optimizations lead to speedups up to 4x over an already optimized code-variant aimed to fit the NFA traversal kernel on FPGA. Some of the proposed optimizations can be generalized for other applications and introduced in OpenCL-to-FPGA compiler.","PeriodicalId":196548,"journal":{"name":"2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"351 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS51040.2020.00073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

While FPGAs have been traditionally considered hard to program, recently there have been efforts aimed to allow the use of high-level programming models and libraries intended for multi-core CPUs and GPUs to program FPGAs. For example, both Intel and Xilinx are now providing toolchains to deploy OpenCL code onto FPGA. However, because the nature of the parallelism offered by GPU and FPGA devices is fundamentally different, OpenCL code optimized for GPU can prove very inefficient on FPGA, in terms of both performance and hardware resource utilization. This paper explores this problem on finite automata traversal. In particular, we consider an OpenCL NFA traversal kernel optimized for GPU but exhibiting FPGA-friendly characteristics, namely: limited memory requirements, lack of synchronization, and SIMD execution. We explore a set of structural code changes, custom and best-practice optimizations to retarget this code to FPGA. We showcase the effect of these optimizations on an Intel Stratix V FPGA board using various NFA topologies from different application domains. Our evaluation shows that, while the resource requirements of the original code exceed the capacity of the FPGA in use, our optimizations lead to significant resource savings and allow the transformed code to fit the FPGA for all considered NFA topologies. In addition, our optimizations lead to speedups up to 4x over an already optimized code-variant aimed to fit the NFA traversal kernel on FPGA. Some of the proposed optimizations can be generalized for other applications and introduced in OpenCL-to-FPGA compiler.

查看原文本刊更多论文

基于FPGA的复杂OpenCL代码优化:以有限自动机遍历为例

虽然fpga传统上被认为很难编程，但最近已经有了旨在允许使用高级编程模型和用于多核cpu和gpu的库来编程fpga的努力。例如，Intel和Xilinx现在都提供工具链来将OpenCL代码部署到FPGA上。然而，由于GPU和FPGA设备提供的并行性本质上是不同的，因此针对GPU优化的OpenCL代码在FPGA上的性能和硬件资源利用率都非常低。本文探讨了有限自动机遍历的这一问题。特别是，我们考虑了一个针对GPU优化的OpenCL NFA遍历内核，但表现出fpga友好的特征，即:有限的内存需求，缺乏同步和SIMD执行。我们探索了一组结构化代码更改，自定义和最佳实践优化，以将此代码重新定位到FPGA。我们使用来自不同应用领域的各种NFA拓扑，在英特尔Stratix V FPGA板上展示了这些优化的效果。我们的评估表明，虽然原始代码的资源需求超过了使用中的FPGA的容量，但我们的优化导致了显着的资源节省，并允许转换后的代码适合所有考虑的NFA拓扑的FPGA。此外，我们的优化导致速度比已经优化的代码变体提高了4倍，旨在适应FPGA上的NFA遍历内核。提出的一些优化可以推广到其他应用中，并在OpenCL-to-FPGA编译器中引入。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)

自引率

0.00%

发文量