基于Intel Stratix 10 fpga的Cannon矩阵乘法算法的OpenCL实现

Paolo Gorlani, Tobias Kenter, Christian Plessl
{"title":"基于Intel Stratix 10 fpga的Cannon矩阵乘法算法的OpenCL实现","authors":"Paolo Gorlani, Tobias Kenter, Christian Plessl","doi":"10.1109/ICFPT47387.2019.00020","DOIUrl":null,"url":null,"abstract":"Stratix 10 FPGA cards have a good potential for the acceleration of HPC workloads since the Stratix 10 product line introduces devices with a large number of DSP and memory blocks. The high level synthesis of OpenCL codes can play a fundamental role for FPGAs in HPC, because it allows to implement different designs with lower development effort compared to hand optimized HDL. However, Stratix 10 cards are still hard to fully exploit using the Intel FPGA SDK for OpenCL. The implementation of designs with thousands of concurrent arithmetic operations often suffers from place and route problems that limit the maximum frequency or entirely prevent a successful synthesis. In order to overcome these issues for the implementation of the matrix multiplication, we formulate Cannon's matrix multiplication algorithm with regard to its efficient synthesis within the FPGA logic. We obtain a two-level block algorithm, where the lower level sub-matrices are multiplied using our Cannon's algorithm implementation. Following this design approach with multiple compute units, we are able to get maximum frequencies close to and above 300 MHz with high utilization of DSP and memory blocks. This allows for performance results above 1 TeraFLOPS.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"OpenCL Implementation of Cannon’s Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs\",\"authors\":\"Paolo Gorlani, Tobias Kenter, Christian Plessl\",\"doi\":\"10.1109/ICFPT47387.2019.00020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stratix 10 FPGA cards have a good potential for the acceleration of HPC workloads since the Stratix 10 product line introduces devices with a large number of DSP and memory blocks. The high level synthesis of OpenCL codes can play a fundamental role for FPGAs in HPC, because it allows to implement different designs with lower development effort compared to hand optimized HDL. However, Stratix 10 cards are still hard to fully exploit using the Intel FPGA SDK for OpenCL. The implementation of designs with thousands of concurrent arithmetic operations often suffers from place and route problems that limit the maximum frequency or entirely prevent a successful synthesis. In order to overcome these issues for the implementation of the matrix multiplication, we formulate Cannon's matrix multiplication algorithm with regard to its efficient synthesis within the FPGA logic. We obtain a two-level block algorithm, where the lower level sub-matrices are multiplied using our Cannon's algorithm implementation. Following this design approach with multiple compute units, we are able to get maximum frequencies close to and above 300 MHz with high utilization of DSP and memory blocks. This allows for performance results above 1 TeraFLOPS.\",\"PeriodicalId\":241340,\"journal\":{\"name\":\"2019 International Conference on Field-Programmable Technology (ICFPT)\",\"volume\":\"134 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Field-Programmable Technology (ICFPT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICFPT47387.2019.00020\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

Stratix 10 FPGA卡在加速HPC工作负载方面具有良好的潜力,因为Stratix 10产品线引入了具有大量DSP和内存块的设备。OpenCL代码的高级合成可以在高性能计算中的fpga中发挥基本作用,因为与手工优化的HDL相比,它允许以更低的开发工作量实现不同的设计。然而,Stratix 10卡仍然很难完全利用OpenCL的英特尔FPGA SDK。具有数千个并发算术运算的设计的实现经常受到位置和路由问题的困扰,这些问题限制了最大频率或完全阻止了成功的综合。为了克服矩阵乘法实现的这些问题,我们制定了Cannon的矩阵乘法算法,考虑到它在FPGA逻辑中的有效合成。我们获得了一个两级块算法,其中使用我们的Cannon算法实现将较低级别的子矩阵相乘。采用这种多计算单元的设计方法,我们能够在DSP和内存块的高利用率下获得接近或高于300 MHz的最大频率。这允许性能结果超过1 TeraFLOPS。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
OpenCL Implementation of Cannon’s Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs
Stratix 10 FPGA cards have a good potential for the acceleration of HPC workloads since the Stratix 10 product line introduces devices with a large number of DSP and memory blocks. The high level synthesis of OpenCL codes can play a fundamental role for FPGAs in HPC, because it allows to implement different designs with lower development effort compared to hand optimized HDL. However, Stratix 10 cards are still hard to fully exploit using the Intel FPGA SDK for OpenCL. The implementation of designs with thousands of concurrent arithmetic operations often suffers from place and route problems that limit the maximum frequency or entirely prevent a successful synthesis. In order to overcome these issues for the implementation of the matrix multiplication, we formulate Cannon's matrix multiplication algorithm with regard to its efficient synthesis within the FPGA logic. We obtain a two-level block algorithm, where the lower level sub-matrices are multiplied using our Cannon's algorithm implementation. Following this design approach with multiple compute units, we are able to get maximum frequencies close to and above 300 MHz with high utilization of DSP and memory blocks. This allows for performance results above 1 TeraFLOPS.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信