系统地扩展了一个支持张量内核的高级代码生成器

Proceedings of the 14th Workshop on General Purpose Processing Using GPU Pub Date : 2022-04-03 DOI:10.1145/3530390.3532733

Lukas Siefke, Bastian Köpcke, S. Gorlatch, Michel Steuwer

{"title":"系统地扩展了一个支持张量内核的高级代码生成器","authors":"Lukas Siefke, Bastian Köpcke, S. Gorlatch, Michel Steuwer","doi":"10.1145/3530390.3532733","DOIUrl":null,"url":null,"abstract":"High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code \"for free\". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.","PeriodicalId":442986,"journal":{"name":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","volume":"365 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Systematically extending a high-level code generator with support for tensor cores\",\"authors\":\"Lukas Siefke, Bastian Köpcke, S. Gorlatch, Michel Steuwer\",\"doi\":\"10.1145/3530390.3532733\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code \\\"for free\\\". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.\",\"PeriodicalId\":442986,\"journal\":{\"name\":\"Proceedings of the 14th Workshop on General Purpose Processing Using GPU\",\"volume\":\"365 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 14th Workshop on General Purpose Processing Using GPU\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3530390.3532733\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3530390.3532733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

像Halide、Lift和RISE这样的高级代码生成器提出了一个引人注目的主张:用简单的高级语言编写程序，并“免费”获得高性能的GPU代码。他们通过将输入语言限制在特定领域(如Halide中的图像和数组处理)或一组固定的灵活并行模式(如Lift和RISE)来实现这一壮举。实现生成高性能代码的高级代码生成器具有挑战性，特别是随着目标硬件的不断发展。在本文中，我们讨论了如何系统地扩展RISE高级代码生成器，以支持张量核，这是最近Nvidia gpu的专用硬件功能。我们强调了RISE的设计，通过遵循系统的自底向上方法，使其易于扩展，首先，将命令式张量核心API暴露给代码生成器，然后，将抽象提升到内部低级函数表示，最后，由从高级函数程序开始的重写过程作为目标。我们的实验评估表明，支持张量核心的RISE生成的代码与手动优化的CUDA代码相比，性能具有竞争力，最高可达36%，但平均只有10%，比Nvidia高度优化的cuBLAS库慢，并且明显优于任何不利用张量核心的代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Systematically extending a high-level code generator with support for tensor cores

High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code "for free". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 14th Workshop on General Purpose Processing Using GPU

自引率

0.00%

发文量