B. Silva, An Braeken, E. D'Hollander, A. Touhafi, Jan G. Cornelis, J. Lemeire
{"title":"GPU/FPGA组合桌面的性能和工具链(仅摘要)","authors":"B. Silva, An Braeken, E. D'Hollander, A. Touhafi, Jan G. Cornelis, J. Lemeire","doi":"10.1145/2435264.2435336","DOIUrl":null,"url":null,"abstract":"Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance analysis of a desktop equipped with a GPU Tesla 2050 and an FPGA Virtex-6 LX240T. First, the balance between the I/O and the raw peak performance is depicted using the roofline model [4]. Next, the performance of a number of image processing algorithms is measured and the results are mapped onto the roofline graph. This allows to compare the GPU and the FPGA and also to optimize the algorithms for both accelerators. A programming toolchain is implemented, consisting of OpenCL for the GPU and several High-Level Synthesis compilers for the FPGA. Our results show that the HLS compilers outperform handwritten code and offer a performance comparable to the GPU. In addition the FPGA compilers reduce the development time by an order of magnitude, at the expense of an increased resource consumption. The roofline model also shows that both accelerators are equally limited by the input/output bandwidth to the host. A well-tuned accelerator-based codesign, identifying the parallelism, the computation and data patterns of different classes of algorithms, will enable to maximize the performance of the combined GPU/FPGA system [1].","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"80 1","pages":"274"},"PeriodicalIF":0.0000,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Performance and toolchain of a combined GPU/FPGA desktop (abstract only)\",\"authors\":\"B. Silva, An Braeken, E. D'Hollander, A. Touhafi, Jan G. Cornelis, J. Lemeire\",\"doi\":\"10.1145/2435264.2435336\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance analysis of a desktop equipped with a GPU Tesla 2050 and an FPGA Virtex-6 LX240T. First, the balance between the I/O and the raw peak performance is depicted using the roofline model [4]. Next, the performance of a number of image processing algorithms is measured and the results are mapped onto the roofline graph. This allows to compare the GPU and the FPGA and also to optimize the algorithms for both accelerators. A programming toolchain is implemented, consisting of OpenCL for the GPU and several High-Level Synthesis compilers for the FPGA. Our results show that the HLS compilers outperform handwritten code and offer a performance comparable to the GPU. In addition the FPGA compilers reduce the development time by an order of magnitude, at the expense of an increased resource consumption. The roofline model also shows that both accelerators are equally limited by the input/output bandwidth to the host. A well-tuned accelerator-based codesign, identifying the parallelism, the computation and data patterns of different classes of algorithms, will enable to maximize the performance of the combined GPU/FPGA system [1].\",\"PeriodicalId\":87257,\"journal\":{\"name\":\"FPGA. ACM International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"80 1\",\"pages\":\"274\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-02-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"FPGA. ACM International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2435264.2435336\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2435264.2435336","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
摘要
如今,低功耗、高性能的计算依赖于加速卡来加速计算。将gpu的强大功能与fpga的灵活性相结合,扩大了可以加速的问题范围[2,3]。我们描述了一个配备GPU Tesla 2050和FPGA Virtex-6 LX240T的台式机的性能分析。首先,使用rooline模型[4]描述I/O和原始峰值性能之间的平衡。接下来,测量了一些图像处理算法的性能,并将结果映射到屋顶线图上。这允许比较GPU和FPGA,也可以优化两个加速器的算法。实现了一个编程工具链,包括用于GPU的OpenCL和用于FPGA的几个高级综合编译器。我们的结果表明,HLS编译器优于手写代码,并提供与GPU相当的性能。此外,FPGA编译器以增加的资源消耗为代价,减少了一个数量级的开发时间。屋顶线模型还表明,两个加速器同样受到主机输入/输出带宽的限制。基于加速器的协同设计,识别不同类别算法的并行性、计算和数据模式,将使GPU/FPGA组合系统的性能最大化[1]。
Performance and toolchain of a combined GPU/FPGA desktop (abstract only)
Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance analysis of a desktop equipped with a GPU Tesla 2050 and an FPGA Virtex-6 LX240T. First, the balance between the I/O and the raw peak performance is depicted using the roofline model [4]. Next, the performance of a number of image processing algorithms is measured and the results are mapped onto the roofline graph. This allows to compare the GPU and the FPGA and also to optimize the algorithms for both accelerators. A programming toolchain is implemented, consisting of OpenCL for the GPU and several High-Level Synthesis compilers for the FPGA. Our results show that the HLS compilers outperform handwritten code and offer a performance comparable to the GPU. In addition the FPGA compilers reduce the development time by an order of magnitude, at the expense of an increased resource consumption. The roofline model also shows that both accelerators are equally limited by the input/output bandwidth to the host. A well-tuned accelerator-based codesign, identifying the parallelism, the computation and data patterns of different classes of algorithms, will enable to maximize the performance of the combined GPU/FPGA system [1].