基于FPGA高级合成的大规模数据共轭梯度加速核脊回归

2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC) Pub Date : 2022-11-01 DOI:10.1109/H2RC56700.2022.00009

Yousef Alnaser, Jan Langer, M. Stoll

{"title":"基于FPGA高级合成的大规模数据共轭梯度加速核脊回归","authors":"Yousef Alnaser, Jan Langer, M. Stoll","doi":"10.1109/H2RC56700.2022.00009","DOIUrl":null,"url":null,"abstract":"In this work, we accelerate the Kernel Ridge Regression algorithm on an FPGA-based adaptive computing platform to achieve higher performance within faster development time by employing a design approach using high-level synthesis (HLS). We partition the overall algorithm into a quadratic complexity part that runs on the FPGA fabric and a linear complexity part that runs in Python on the ARM processors. In order to avoid storing the potentially huge kernel matrix in external memory, the designed accelerator computes the matrix on-the-fly in each iteration. Moreover, we overcome the memory bandwidth limitation by partitioning the kernel matrix into smaller tiles that are pre-fetched to small local memories and reused multiple times. The design is also parallelized and fully pipelined. The final accelerator can be used for any large-scale data without kernel matrix storage limitations and with an arbitrary number of features. The accelerator reaches 86 GFLOPS on a Kria XCK26 FPGA and 231 GFLOPS on an Alveo U30 card which is within 72% and 90% of the estimated peak performance of those boards. We use the Pynq framework to directly call the accelerator from our Python code. This work is an important first step towards a library for accelerating different Kernel methods for Machine Learning applications for different FPGA platforms that can be used conveniently from Python with a NumPy-like interface.","PeriodicalId":102662,"journal":{"name":"2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accelerating Kernel Ridge Regression with Conjugate Gradient Method for large-scale data using FPGA High-level Synthesis\",\"authors\":\"Yousef Alnaser, Jan Langer, M. Stoll\",\"doi\":\"10.1109/H2RC56700.2022.00009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we accelerate the Kernel Ridge Regression algorithm on an FPGA-based adaptive computing platform to achieve higher performance within faster development time by employing a design approach using high-level synthesis (HLS). We partition the overall algorithm into a quadratic complexity part that runs on the FPGA fabric and a linear complexity part that runs in Python on the ARM processors. In order to avoid storing the potentially huge kernel matrix in external memory, the designed accelerator computes the matrix on-the-fly in each iteration. Moreover, we overcome the memory bandwidth limitation by partitioning the kernel matrix into smaller tiles that are pre-fetched to small local memories and reused multiple times. The design is also parallelized and fully pipelined. The final accelerator can be used for any large-scale data without kernel matrix storage limitations and with an arbitrary number of features. The accelerator reaches 86 GFLOPS on a Kria XCK26 FPGA and 231 GFLOPS on an Alveo U30 card which is within 72% and 90% of the estimated peak performance of those boards. We use the Pynq framework to directly call the accelerator from our Python code. This work is an important first step towards a library for accelerating different Kernel methods for Machine Learning applications for different FPGA platforms that can be used conveniently from Python with a NumPy-like interface.\",\"PeriodicalId\":102662,\"journal\":{\"name\":\"2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/H2RC56700.2022.00009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/H2RC56700.2022.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在这项工作中，我们在基于fpga的自适应计算平台上加速核岭回归算法，通过采用高级综合(HLS)的设计方法，在更快的开发时间内实现更高的性能。我们将整个算法划分为在FPGA结构上运行的二次复杂度部分和在ARM处理器上用Python运行的线性复杂度部分。为了避免将庞大的内核矩阵存储在外部存储器中，所设计的加速器在每次迭代中对矩阵进行实时计算。此外，我们通过将内核矩阵划分为更小的块来克服内存带宽限制，这些块被预先提取到较小的本地内存中并被多次重用。该设计也是并行化和全流水线的。最终的加速器可用于任何大规模数据，没有核矩阵存储限制，具有任意数量的特征。该加速器在Kria XCK26 FPGA上达到86 GFLOPS，在Alveo U30卡上达到231 GFLOPS，分别在这些板的估计峰值性能的72%和90%之内。我们使用Pynq框架从Python代码中直接调用加速器。这项工作是迈向一个库的重要第一步，用于加速不同FPGA平台的机器学习应用程序的不同内核方法，可以方便地从Python使用numpy类接口。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating Kernel Ridge Regression with Conjugate Gradient Method for large-scale data using FPGA High-level Synthesis

In this work, we accelerate the Kernel Ridge Regression algorithm on an FPGA-based adaptive computing platform to achieve higher performance within faster development time by employing a design approach using high-level synthesis (HLS). We partition the overall algorithm into a quadratic complexity part that runs on the FPGA fabric and a linear complexity part that runs in Python on the ARM processors. In order to avoid storing the potentially huge kernel matrix in external memory, the designed accelerator computes the matrix on-the-fly in each iteration. Moreover, we overcome the memory bandwidth limitation by partitioning the kernel matrix into smaller tiles that are pre-fetched to small local memories and reused multiple times. The design is also parallelized and fully pipelined. The final accelerator can be used for any large-scale data without kernel matrix storage limitations and with an arbitrary number of features. The accelerator reaches 86 GFLOPS on a Kria XCK26 FPGA and 231 GFLOPS on an Alveo U30 card which is within 72% and 90% of the estimated peak performance of those boards. We use the Pynq framework to directly call the accelerator from our Python code. This work is an important first step towards a library for accelerating different Kernel methods for Machine Learning applications for different FPGA platforms that can be used conveniently from Python with a NumPy-like interface.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)

自引率

0.00%

发文量