RxRE:基于资源感知规则提取的高级合成吞吐量优化(仅摘要)

A. Lotfi, Rajesh K. Gupta
{"title":"RxRE:基于资源感知规则提取的高级合成吞吐量优化(仅摘要)","authors":"A. Lotfi, Rajesh K. Gupta","doi":"10.1145/3020078.3021797","DOIUrl":null,"url":null,"abstract":"Despite the considerable improvements in the quality of HLS tools, they still require the designer's manual optimizations and tweaks to generate efficient results, which negates the HLS design productivity gains. Majority of designer interventions lead to optimizations that are often global in nature, for instance, finding patterns in functions that better fit a custom designed solution. We introduce a high-level resource-aware regularity extraction workflow, called RxRE that detects a class of patterns in an input program, and enhances resource sharing to balance resource usage against increased throughput. RxRE automatically detects structural patterns, or repeated sequence of floating-point operations, from sequential loops, selects suitable resources for them, and shares resources for all instances of the selected patterns. RxRE reduces required hardware area for synthesizing an instance of the program. Hence, more number of program replicas can be fitted in the fixed area budget of an FPGA. RxRE contributes to a pre-synthesis workflow that exploits the inherent regularity of applications to achieve higher computational throughput using off-the-shelf HLS tools without any changes to the HLS flow. It uses a string-based pattern detection approach to find linear patterns across loops within the same function. It deploys a simple but effective model to estimate resource utilization and latency of each candidate design, to avoid synthesizing every possible design alternative. We have implemented and evaluated RxRE using a set of C benchmarks. The synthesis results on a Xilinx Virtex FPGA show that the reduced area of the transformed programs improves the number of mapped kernels by a factor of 1.54X on average (maximum 2.8X) which yields on average 1.59X (maximum 2.4X) higher throughput over Xilinx Vivado HLS tool solution. Current implementation has several limitations and only extracts a special case of regularity that is subject of current optimization and study.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RxRE: Throughput Optimization for High-Level Synthesis using Resource-Aware Regularity Extraction (Abstract Only)\",\"authors\":\"A. Lotfi, Rajesh K. Gupta\",\"doi\":\"10.1145/3020078.3021797\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite the considerable improvements in the quality of HLS tools, they still require the designer's manual optimizations and tweaks to generate efficient results, which negates the HLS design productivity gains. Majority of designer interventions lead to optimizations that are often global in nature, for instance, finding patterns in functions that better fit a custom designed solution. We introduce a high-level resource-aware regularity extraction workflow, called RxRE that detects a class of patterns in an input program, and enhances resource sharing to balance resource usage against increased throughput. RxRE automatically detects structural patterns, or repeated sequence of floating-point operations, from sequential loops, selects suitable resources for them, and shares resources for all instances of the selected patterns. RxRE reduces required hardware area for synthesizing an instance of the program. Hence, more number of program replicas can be fitted in the fixed area budget of an FPGA. RxRE contributes to a pre-synthesis workflow that exploits the inherent regularity of applications to achieve higher computational throughput using off-the-shelf HLS tools without any changes to the HLS flow. It uses a string-based pattern detection approach to find linear patterns across loops within the same function. It deploys a simple but effective model to estimate resource utilization and latency of each candidate design, to avoid synthesizing every possible design alternative. We have implemented and evaluated RxRE using a set of C benchmarks. The synthesis results on a Xilinx Virtex FPGA show that the reduced area of the transformed programs improves the number of mapped kernels by a factor of 1.54X on average (maximum 2.8X) which yields on average 1.59X (maximum 2.4X) higher throughput over Xilinx Vivado HLS tool solution. Current implementation has several limitations and only extracts a special case of regularity that is subject of current optimization and study.\",\"PeriodicalId\":252039,\"journal\":{\"name\":\"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3020078.3021797\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3020078.3021797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

尽管HLS工具的质量有了相当大的提高,但它们仍然需要设计师手动优化和调整以产生有效的结果,这抵消了HLS设计生产力的提高。大多数设计人员的干预导致的优化通常是全局的,例如,在功能中找到更适合定制设计的解决方案的模式。我们引入了一个高级的资源感知规则提取工作流,称为RxRE,它检测输入程序中的一类模式,并增强资源共享以平衡资源使用和增加的吞吐量。RxRE从顺序循环中自动检测结构模式或重复的浮点操作序列,为它们选择合适的资源,并为所选模式的所有实例共享资源。RxRE减少了合成程序实例所需的硬件面积。因此,在FPGA的固定面积预算中可以容纳更多的程序副本。RxRE有助于预合成工作流,该工作流利用应用程序的固有规律性,使用现成的HLS工具实现更高的计算吞吐量,而无需对HLS流进行任何更改。它使用基于字符串的模式检测方法来查找同一函数中跨循环的线性模式。它部署了一个简单但有效的模型来估计每个候选设计的资源利用率和延迟,以避免综合所有可能的设计方案。我们已经使用一组C基准来实现和评估RxRE。在Xilinx Virtex FPGA上的合成结果表明,与Xilinx Vivado HLS工具解决方案相比,转换后的程序面积减少后,映射核的数量平均提高了1.54倍(最大2.8倍),吞吐量平均提高了1.59倍(最大2.4倍)。目前的实现有几个限制,并且只提取了当前优化和研究的主题的一个特殊的规则情况。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
RxRE: Throughput Optimization for High-Level Synthesis using Resource-Aware Regularity Extraction (Abstract Only)
Despite the considerable improvements in the quality of HLS tools, they still require the designer's manual optimizations and tweaks to generate efficient results, which negates the HLS design productivity gains. Majority of designer interventions lead to optimizations that are often global in nature, for instance, finding patterns in functions that better fit a custom designed solution. We introduce a high-level resource-aware regularity extraction workflow, called RxRE that detects a class of patterns in an input program, and enhances resource sharing to balance resource usage against increased throughput. RxRE automatically detects structural patterns, or repeated sequence of floating-point operations, from sequential loops, selects suitable resources for them, and shares resources for all instances of the selected patterns. RxRE reduces required hardware area for synthesizing an instance of the program. Hence, more number of program replicas can be fitted in the fixed area budget of an FPGA. RxRE contributes to a pre-synthesis workflow that exploits the inherent regularity of applications to achieve higher computational throughput using off-the-shelf HLS tools without any changes to the HLS flow. It uses a string-based pattern detection approach to find linear patterns across loops within the same function. It deploys a simple but effective model to estimate resource utilization and latency of each candidate design, to avoid synthesizing every possible design alternative. We have implemented and evaluated RxRE using a set of C benchmarks. The synthesis results on a Xilinx Virtex FPGA show that the reduced area of the transformed programs improves the number of mapped kernels by a factor of 1.54X on average (maximum 2.8X) which yields on average 1.59X (maximum 2.4X) higher throughput over Xilinx Vivado HLS tool solution. Current implementation has several limitations and only extracts a special case of regularity that is subject of current optimization and study.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信