High-level synthesis of multiple dependent CUDA kernels on FPGA

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC) Pub Date : 2013-04-29 DOI:10.1109/ASPDAC.2013.6509613

S. Gurumani, Hisham Cholakkal, Yun Liang, K. Rupnow, Deming Chen

{"title":"High-level synthesis of multiple dependent CUDA kernels on FPGA","authors":"S. Gurumani, Hisham Cholakkal, Yun Liang, K. Rupnow, Deming Chen","doi":"10.1109/ASPDAC.2013.6509613","DOIUrl":null,"url":null,"abstract":"High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASPDAC.2013.6509613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.

查看原文本刊更多论文

基于FPGA的多相关CUDA内核高级合成

高级合成(HLS)工具根据用高级语言编写的算法描述，在寄存器传输级别(RTL)自动生成硬件，从而可以更快地为FPGA架构创建自定义加速器。现有的HLS工具支持多种输入语言，并通过自动化和对设计性能瓶颈的反馈来帮助用户进行设计空间探索。这种设计空间探索应用了流水线、分区和资源共享等技术，以提高性能和资源利用率。然而，尽管自动化探索可以找到一些固有的并行性，但是数据并行输入源代码在揭示更多的并行性方面仍然是优越的。在之前的工作中，我们演示了GPU多线程(CUDA)语言源代码的自动设计空间探索，以实现高效的RTL生成。在本文中，我们研究了将这种自动化设计空间探索扩展到多个依赖的CUDA内核中的挑战，演示了有效执行多核合成的逐步过程，并通过立体匹配算法的案例研究展示了这种方法的潜力。本研究表明，多依赖CUDA内核的HLS可以保持与GPU实现的性能对等，同时消耗的能量比GPU少16倍以上。基于我们的手动程序，我们确定了完全自动化多内核CUDA程序合成的关键挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

自引率

0.00%

发文量