将加速器卸载决策推迟到应用程序运行时

2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14) Pub Date : 2014-12-01 DOI:10.1109/ReConFig.2014.7032509

G. Vaz, Heinrich Riebler, Tobias Kenter, Christian Plessl

{"title":"将加速器卸载决策推迟到应用程序运行时","authors":"G. Vaz, Heinrich Riebler, Tobias Kenter, Christian Plessl","doi":"10.1109/ReConFig.2014.7032509","DOIUrl":null,"url":null,"abstract":"Reconfigurable architectures provide an opportunity to accelerate a wide range of applications, frequently by exploiting data-parallelism, where the same operations are homogeneously executed on a (large) set of data. However, when the sequential code is executed on a host CPU and only data-parallel loops are executed on an FPGA coprocessor, a sufficiently large number of loop iterations (trip counts) is required, such that the control- and data-transfer overheads to the coprocessor can be amortized. However, the trip count of large data-parallel loops is frequently not known at compile time, but only at runtime just before entering a loop. Therefore, we propose to generate code both for the CPU and the coprocessor, and to defer the decision where to execute the appropriate code to the runtime of the application when the trip count of the loop can be determined just at runtime. We demonstrate how an LLVM compiler based toolflow can automatically insert appropriate decision blocks into the application code. Analyzing popular benchmark suites, we show that this kind of runtime decisions is often applicable. The practical feasibility of our approach is demonstrated by a toolflow that automatically identifies loops suitable for vectorization and generates code for the FPGA coprocessor of a Convey HC-1. The toolflow adds decisions based on a comparison of the runtime-computed trip counts to thresholds for specific loops and also includes support to move just the required data to the coprocessor. We evaluate the integrated toolflow with characteristic loops executed on different input data sizes.","PeriodicalId":137331,"journal":{"name":"2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14)","volume":"415 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Deferring accelerator offloading decisions to application runtime\",\"authors\":\"G. Vaz, Heinrich Riebler, Tobias Kenter, Christian Plessl\",\"doi\":\"10.1109/ReConFig.2014.7032509\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reconfigurable architectures provide an opportunity to accelerate a wide range of applications, frequently by exploiting data-parallelism, where the same operations are homogeneously executed on a (large) set of data. However, when the sequential code is executed on a host CPU and only data-parallel loops are executed on an FPGA coprocessor, a sufficiently large number of loop iterations (trip counts) is required, such that the control- and data-transfer overheads to the coprocessor can be amortized. However, the trip count of large data-parallel loops is frequently not known at compile time, but only at runtime just before entering a loop. Therefore, we propose to generate code both for the CPU and the coprocessor, and to defer the decision where to execute the appropriate code to the runtime of the application when the trip count of the loop can be determined just at runtime. We demonstrate how an LLVM compiler based toolflow can automatically insert appropriate decision blocks into the application code. Analyzing popular benchmark suites, we show that this kind of runtime decisions is often applicable. The practical feasibility of our approach is demonstrated by a toolflow that automatically identifies loops suitable for vectorization and generates code for the FPGA coprocessor of a Convey HC-1. The toolflow adds decisions based on a comparison of the runtime-computed trip counts to thresholds for specific loops and also includes support to move just the required data to the coprocessor. We evaluate the integrated toolflow with characteristic loops executed on different input data sizes.\",\"PeriodicalId\":137331,\"journal\":{\"name\":\"2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14)\",\"volume\":\"415 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ReConFig.2014.7032509\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReConFig.2014.7032509","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

可重构架构提供了一个机会来加速广泛的应用程序，通常是通过利用数据并行性，在这种情况下，相同的操作在一组(大)数据上均匀地执行。然而，当顺序代码在主机CPU上执行并且仅在FPGA协处理器上执行数据并行循环时，需要足够多的循环迭代(行程计数)，以便可以平摊到协处理器的控制和数据传输开销。然而，大型数据并行循环的行程计数通常在编译时不知道，而只在进入循环之前的运行时知道。因此，我们建议同时为CPU和协处理器生成代码，并将在何处执行适当代码的决定推迟到应用程序的运行时，当循环的行程计数可以在运行时确定时。我们将演示基于LLVM编译器的工具流如何自动将适当的决策块插入到应用程序代码中。通过分析流行的基准套件，我们发现这种类型的运行时决策通常是适用的。我们的方法的实际可行性通过一个工具流来证明，该工具流可以自动识别适合向量化的循环，并为Convey HC-1的FPGA协处理器生成代码。工具流根据运行时计算的行程计数与特定循环的阈值的比较添加决策，还支持只将所需的数据移动到协处理器。我们用在不同输入数据大小上执行的特征循环来评估集成的工具流。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deferring accelerator offloading decisions to application runtime

Reconfigurable architectures provide an opportunity to accelerate a wide range of applications, frequently by exploiting data-parallelism, where the same operations are homogeneously executed on a (large) set of data. However, when the sequential code is executed on a host CPU and only data-parallel loops are executed on an FPGA coprocessor, a sufficiently large number of loop iterations (trip counts) is required, such that the control- and data-transfer overheads to the coprocessor can be amortized. However, the trip count of large data-parallel loops is frequently not known at compile time, but only at runtime just before entering a loop. Therefore, we propose to generate code both for the CPU and the coprocessor, and to defer the decision where to execute the appropriate code to the runtime of the application when the trip count of the loop can be determined just at runtime. We demonstrate how an LLVM compiler based toolflow can automatically insert appropriate decision blocks into the application code. Analyzing popular benchmark suites, we show that this kind of runtime decisions is often applicable. The practical feasibility of our approach is demonstrated by a toolflow that automatically identifies loops suitable for vectorization and generates code for the FPGA coprocessor of a Convey HC-1. The toolflow adds decisions based on a comparison of the runtime-computed trip counts to thresholds for specific loops and also includes support to move just the required data to the coprocessor. We evaluate the integrated toolflow with characteristic loops executed on different input data sizes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14)

自引率

0.00%

发文量