Efficient compilation of CUDA kernels for high-performance computing on FPGAs

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI:10.1145/2514641.2514652

Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, J. Cong, Wen-mei W. Hwu

{"title":"Efficient compilation of CUDA kernels for high-performance computing on FPGAs","authors":"Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, J. Cong, Wen-mei W. Hwu","doi":"10.1145/2514641.2514652","DOIUrl":null,"url":null,"abstract":"The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Trans. Embed. Comput. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2514641.2514652","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

查看原文本刊更多论文

高效编译CUDA内核在fpga上的高性能计算

跨所有计算领域的多核架构的兴起为异构多处理器打开了大门，在异构多处理器中，不同计算特性的处理器可以组合在一起，以有效地提高不同应用程序内核的每瓦特性能。特别是gpu，在加速科学、成像和仿真应用程序的计算密集型内核方面变得非常流行。在包含gpu的异构系统上促进并行处理的新编程模型正在计算界迅速传播。通过利用这些投资，其他加速器的开发人员就有机会通过支持那些已经流行起来的加速器模型来显著减少编程工作。在这项工作中，我们将一种这样的语言，CUDA编程模型，改编成一种新的FPGA设计流程，称为FCUDA，它有效地将CUDA中暴露的粗粒度和细粒度并行性映射到可重构结构上。我们的CUDA-to-FPGA流程采用AutoPilot，这是一种先进的高级合成工具(可从Xilinx获得)，可实现高抽象FPGA编程。FCUDA基于源到源的编译，将SIMT(单指令，多线程)CUDA代码转换为自动驾驶仪的任务级并行C代码。我们描述了CUDA-to-FPGA流程的细节，并演示了由此产生的定制FPGA多核加速器的极具竞争力的性能。据我们所知，这是第一个CUDA到fpga的流程，以证明在fpga中使用CUDA编程模型进行高性能计算的适用性和潜在优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Trans. Embed. Comput. Syst.

自引率

0.00%

发文量