{"title":"基于FPGA高级合成的大规模数据共轭梯度加速核脊回归","authors":"Yousef Alnaser, Jan Langer, M. Stoll","doi":"10.1109/H2RC56700.2022.00009","DOIUrl":null,"url":null,"abstract":"In this work, we accelerate the Kernel Ridge Regression algorithm on an FPGA-based adaptive computing platform to achieve higher performance within faster development time by employing a design approach using high-level synthesis (HLS). We partition the overall algorithm into a quadratic complexity part that runs on the FPGA fabric and a linear complexity part that runs in Python on the ARM processors. In order to avoid storing the potentially huge kernel matrix in external memory, the designed accelerator computes the matrix on-the-fly in each iteration. Moreover, we overcome the memory bandwidth limitation by partitioning the kernel matrix into smaller tiles that are pre-fetched to small local memories and reused multiple times. The design is also parallelized and fully pipelined. The final accelerator can be used for any large-scale data without kernel matrix storage limitations and with an arbitrary number of features. The accelerator reaches 86 GFLOPS on a Kria XCK26 FPGA and 231 GFLOPS on an Alveo U30 card which is within 72% and 90% of the estimated peak performance of those boards. We use the Pynq framework to directly call the accelerator from our Python code. This work is an important first step towards a library for accelerating different Kernel methods for Machine Learning applications for different FPGA platforms that can be used conveniently from Python with a NumPy-like interface.","PeriodicalId":102662,"journal":{"name":"2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accelerating Kernel Ridge Regression with Conjugate Gradient Method for large-scale data using FPGA High-level Synthesis\",\"authors\":\"Yousef Alnaser, Jan Langer, M. Stoll\",\"doi\":\"10.1109/H2RC56700.2022.00009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we accelerate the Kernel Ridge Regression algorithm on an FPGA-based adaptive computing platform to achieve higher performance within faster development time by employing a design approach using high-level synthesis (HLS). We partition the overall algorithm into a quadratic complexity part that runs on the FPGA fabric and a linear complexity part that runs in Python on the ARM processors. In order to avoid storing the potentially huge kernel matrix in external memory, the designed accelerator computes the matrix on-the-fly in each iteration. Moreover, we overcome the memory bandwidth limitation by partitioning the kernel matrix into smaller tiles that are pre-fetched to small local memories and reused multiple times. The design is also parallelized and fully pipelined. The final accelerator can be used for any large-scale data without kernel matrix storage limitations and with an arbitrary number of features. The accelerator reaches 86 GFLOPS on a Kria XCK26 FPGA and 231 GFLOPS on an Alveo U30 card which is within 72% and 90% of the estimated peak performance of those boards. We use the Pynq framework to directly call the accelerator from our Python code. This work is an important first step towards a library for accelerating different Kernel methods for Machine Learning applications for different FPGA platforms that can be used conveniently from Python with a NumPy-like interface.\",\"PeriodicalId\":102662,\"journal\":{\"name\":\"2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/H2RC56700.2022.00009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/H2RC56700.2022.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Accelerating Kernel Ridge Regression with Conjugate Gradient Method for large-scale data using FPGA High-level Synthesis
In this work, we accelerate the Kernel Ridge Regression algorithm on an FPGA-based adaptive computing platform to achieve higher performance within faster development time by employing a design approach using high-level synthesis (HLS). We partition the overall algorithm into a quadratic complexity part that runs on the FPGA fabric and a linear complexity part that runs in Python on the ARM processors. In order to avoid storing the potentially huge kernel matrix in external memory, the designed accelerator computes the matrix on-the-fly in each iteration. Moreover, we overcome the memory bandwidth limitation by partitioning the kernel matrix into smaller tiles that are pre-fetched to small local memories and reused multiple times. The design is also parallelized and fully pipelined. The final accelerator can be used for any large-scale data without kernel matrix storage limitations and with an arbitrary number of features. The accelerator reaches 86 GFLOPS on a Kria XCK26 FPGA and 231 GFLOPS on an Alveo U30 card which is within 72% and 90% of the estimated peak performance of those boards. We use the Pynq framework to directly call the accelerator from our Python code. This work is an important first step towards a library for accelerating different Kernel methods for Machine Learning applications for different FPGA platforms that can be used conveniently from Python with a NumPy-like interface.