扩展任务并行程序的高级综合。

Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium) Pub Date : 2021-05-01 Epub Date: 2021-06-02 DOI:10.1109/fccm51124.2021.00032

Yuze Chi, Licheng Guo, Jason Lau, Young-Kyu Choi, Jie Wang, Jason Cong

{"title":"扩展任务并行程序的高级综合。","authors":"Yuze Chi, Licheng Guo, Jason Lau, Young-Kyu Choi, Jie Wang, Jason Cong","doi":"10.1109/fccm51124.2021.00032","DOIUrl":null,"url":null,"abstract":"C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools do support task-parallel programs, the productivity is greatly limited ① in the code development cycle due to the poor programmability, ② in the correctness verification cycle due to restricted software simulation, and ③ in the QoR tuning cycle due to slow code generation. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, unconstrained software simulation, and fast hierarchical code generation to overcome these limitations and demonstrate how task-parallel programs can be productively supported in HLS. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly shortened by 3.2× and 6.8×, respectively. Our work is open-source at https://github.com/UCLA-VAST/tapa/.","PeriodicalId":93352,"journal":{"name":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","volume":"2021 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/fccm51124.2021.00032","citationCount":"32","resultStr":"{\"title\":\"Extending High-Level Synthesis for Task-Parallel Programs.\",\"authors\":\"Yuze Chi, Licheng Guo, Jason Lau, Young-Kyu Choi, Jie Wang, Jason Cong\",\"doi\":\"10.1109/fccm51124.2021.00032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools do support task-parallel programs, the productivity is greatly limited ① in the code development cycle due to the poor programmability, ② in the correctness verification cycle due to restricted software simulation, and ③ in the QoR tuning cycle due to slow code generation. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, unconstrained software simulation, and fast hierarchical code generation to overcome these limitations and demonstrate how task-parallel programs can be productively supported in HLS. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly shortened by 3.2× and 6.8×, respectively. Our work is open-source at https://github.com/UCLA-VAST/tapa/.\",\"PeriodicalId\":93352,\"journal\":{\"name\":\"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)\",\"volume\":\"2021 \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/fccm51124.2021.00032\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/fccm51124.2021.00032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/6/2 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/fccm51124.2021.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/6/2 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

摘要

近年来，基于C/ c++ / opencl的高阶综合(high-level synthesis, HLS)方法由于其具有较好的结果质量(QoR)和较短的开发周期，与传统的寄存器-传输级设计方法相比，在FPGA (field-programmable gate array, FPGA)加速器中越来越受到广泛的应用。然而，由于受到顺序C语义的限制，在许多其他应用程序领域中采用相同的高效高级编程方法仍然具有挑战性，在这些领域中，粗粒度任务并行运行，并在细粒度级别上相互通信。虽然目前的HLS工具确实支持任务并行程序，但其生产力受到很大限制:①在代码开发周期中，由于可编程性差;②在正确性验证周期中，由于软件模拟受限;③在QoR调优周期中，由于代码生成缓慢。这种有限的生产效率通常会破坏HLS的目的，并阻碍程序员将HLS用于任务并行FPGA加速器。在本文中，我们扩展了HLS c++语言，并提出了一个完全自动化的框架，具有程序员友好的界面，不受约束的软件模拟和快速分层代码生成，以克服这些限制，并演示了如何在HLS中有效地支持任务并行程序。基于广泛的实际任务并行程序的实验结果表明，平均而言，内核和主机代码的行数分别减少了22%和51%，这大大提高了可编程性。正确性验证和迭代QoR调优周期分别大大缩短了3.2倍和6.8倍。我们的工作是开源的，网址是https://github.com/UCLA-VAST/tapa/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extending High-Level Synthesis for Task-Parallel Programs.

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools do support task-parallel programs, the productivity is greatly limited ① in the code development cycle due to the poor programmability, ② in the correctness verification cycle due to restricted software simulation, and ③ in the QoR tuning cycle due to slow code generation. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, unconstrained software simulation, and fast hierarchical code generation to overcome these limitations and demonstrate how task-parallel programs can be productively supported in HLS. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly shortened by 3.2× and 6.8×, respectively. Our work is open-source at https://github.com/UCLA-VAST/tapa/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings ... Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM (Symposium)

自引率

0.00%

发文量