一种高效的FPGA布局感知叠加加速器映射方法

Tanvir Ahmed, Johannes Maximilian Kühn, Ken Namura
{"title":"一种高效的FPGA布局感知叠加加速器映射方法","authors":"Tanvir Ahmed, Johannes Maximilian Kühn, Ken Namura","doi":"10.1109/MCSoC51149.2021.00046","DOIUrl":null,"url":null,"abstract":"FPGAs are gathering traction as a platform for the acceleration of applications requiring both high performance and specialization. However, exploiting the maximum compute potential of FPGAs remains a critical and time-consuming task, usually requiring expert knowledge. Typically, designers seek to maximize the usage of hardened arithmetic blocks (DSP, such as DSP48 in Xilinx devices), but as their number is limited, the critical path quickly increases when portions are mapped to lookup tables (LUT). To mitigate the DSP limitation and to maximize FPGA utilization, we propose combining FPGA overlay accelerators and a mapping method that efficiently exploits the FPGA's layout information and its resources. This mapping method relies on a two-step process: 1. extraction of architectural and layout information of the FPGA, 2. optimized placement of the processing elements (PEs) of the accelerator onto the FPGA resources. The placement step maps the PEs to DSPs and LUTs to reduce the critical path among PEs. We applied our method to implement a systolic array, a multiplier array, and a coarse-grained reconfigurable architecture (CGRA) on a Xilinx FPGA. The proposed method achieves more than 14 x performance and energy efficiency increase over the vendor tool mapping while equally maximizing FPGA utilization by more than 1.5 x compared to DSP limited mappings.","PeriodicalId":166811,"journal":{"name":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Highly Efficient Layout-Aware FPGA Overlay Accelerator Mapping Method\",\"authors\":\"Tanvir Ahmed, Johannes Maximilian Kühn, Ken Namura\",\"doi\":\"10.1109/MCSoC51149.2021.00046\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"FPGAs are gathering traction as a platform for the acceleration of applications requiring both high performance and specialization. However, exploiting the maximum compute potential of FPGAs remains a critical and time-consuming task, usually requiring expert knowledge. Typically, designers seek to maximize the usage of hardened arithmetic blocks (DSP, such as DSP48 in Xilinx devices), but as their number is limited, the critical path quickly increases when portions are mapped to lookup tables (LUT). To mitigate the DSP limitation and to maximize FPGA utilization, we propose combining FPGA overlay accelerators and a mapping method that efficiently exploits the FPGA's layout information and its resources. This mapping method relies on a two-step process: 1. extraction of architectural and layout information of the FPGA, 2. optimized placement of the processing elements (PEs) of the accelerator onto the FPGA resources. The placement step maps the PEs to DSPs and LUTs to reduce the critical path among PEs. We applied our method to implement a systolic array, a multiplier array, and a coarse-grained reconfigurable architecture (CGRA) on a Xilinx FPGA. The proposed method achieves more than 14 x performance and energy efficiency increase over the vendor tool mapping while equally maximizing FPGA utilization by more than 1.5 x compared to DSP limited mappings.\",\"PeriodicalId\":166811,\"journal\":{\"name\":\"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MCSoC51149.2021.00046\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MCSoC51149.2021.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

fpga作为加速需要高性能和专门化的应用程序的平台,越来越受到关注。然而,利用fpga的最大计算潜力仍然是一项关键且耗时的任务,通常需要专业知识。通常,设计人员寻求最大限度地使用强化算术块(DSP,如Xilinx设备中的DSP48),但由于它们的数量有限,当部分映射到查找表(LUT)时,关键路径迅速增加。为了减轻DSP的限制并最大限度地提高FPGA的利用率,我们提出将FPGA覆盖加速器与有效利用FPGA布局信息及其资源的映射方法相结合。这种映射方法依赖于两个步骤:1。2. FPGA结构和布局信息的提取;将加速器的处理元件(pe)优化放置到FPGA资源上。放置步骤将pe映射到dsp和lut,以减少pe之间的关键路径。我们应用我们的方法在Xilinx FPGA上实现了收缩阵列、乘法器阵列和粗粒度可重构架构(CGRA)。与供应商工具映射相比,所提出的方法实现了超过14倍的性能和能效提升,同时与DSP有限映射相比,FPGA利用率同样最大化1.5倍以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Highly Efficient Layout-Aware FPGA Overlay Accelerator Mapping Method
FPGAs are gathering traction as a platform for the acceleration of applications requiring both high performance and specialization. However, exploiting the maximum compute potential of FPGAs remains a critical and time-consuming task, usually requiring expert knowledge. Typically, designers seek to maximize the usage of hardened arithmetic blocks (DSP, such as DSP48 in Xilinx devices), but as their number is limited, the critical path quickly increases when portions are mapped to lookup tables (LUT). To mitigate the DSP limitation and to maximize FPGA utilization, we propose combining FPGA overlay accelerators and a mapping method that efficiently exploits the FPGA's layout information and its resources. This mapping method relies on a two-step process: 1. extraction of architectural and layout information of the FPGA, 2. optimized placement of the processing elements (PEs) of the accelerator onto the FPGA resources. The placement step maps the PEs to DSPs and LUTs to reduce the critical path among PEs. We applied our method to implement a systolic array, a multiplier array, and a coarse-grained reconfigurable architecture (CGRA) on a Xilinx FPGA. The proposed method achieves more than 14 x performance and energy efficiency increase over the vendor tool mapping while equally maximizing FPGA utilization by more than 1.5 x compared to DSP limited mappings.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信