Systematically extending a high-level code generator with support for tensor cores

Proceedings of the 14th Workshop on General Purpose Processing Using GPU Pub Date : 2022-04-03 DOI:10.1145/3530390.3532733

Lukas Siefke, Bastian Köpcke, S. Gorlatch, Michel Steuwer

{"title":"Systematically extending a high-level code generator with support for tensor cores","authors":"Lukas Siefke, Bastian Köpcke, S. Gorlatch, Michel Steuwer","doi":"10.1145/3530390.3532733","DOIUrl":null,"url":null,"abstract":"High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code \"for free\". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.","PeriodicalId":442986,"journal":{"name":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","volume":"365 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3530390.3532733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code "for free". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.

查看原文本刊更多论文

系统地扩展了一个支持张量内核的高级代码生成器

像Halide、Lift和RISE这样的高级代码生成器提出了一个引人注目的主张:用简单的高级语言编写程序，并“免费”获得高性能的GPU代码。他们通过将输入语言限制在特定领域(如Halide中的图像和数组处理)或一组固定的灵活并行模式(如Lift和RISE)来实现这一壮举。实现生成高性能代码的高级代码生成器具有挑战性，特别是随着目标硬件的不断发展。在本文中，我们讨论了如何系统地扩展RISE高级代码生成器，以支持张量核，这是最近Nvidia gpu的专用硬件功能。我们强调了RISE的设计，通过遵循系统的自底向上方法，使其易于扩展，首先，将命令式张量核心API暴露给代码生成器，然后，将抽象提升到内部低级函数表示，最后，由从高级函数程序开始的重写过程作为目标。我们的实验评估表明，支持张量核心的RISE生成的代码与手动优化的CUDA代码相比，性能具有竞争力，最高可达36%，但平均只有10%，比Nvidia高度优化的cuBLAS库慢，并且明显优于任何不利用张量核心的代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 14th Workshop on General Purpose Processing Using GPU

自引率

0.00%

发文量