Systematically extending a high-level code generator with support for tensor cores

Lukas Siefke, Bastian Köpcke, S. Gorlatch, Michel Steuwer
{"title":"Systematically extending a high-level code generator with support for tensor cores","authors":"Lukas Siefke, Bastian Köpcke, S. Gorlatch, Michel Steuwer","doi":"10.1145/3530390.3532733","DOIUrl":null,"url":null,"abstract":"High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code \"for free\". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.","PeriodicalId":442986,"journal":{"name":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","volume":"365 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th Workshop on General Purpose Processing Using GPU","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3530390.3532733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code "for free". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.
系统地扩展了一个支持张量内核的高级代码生成器
像Halide、Lift和RISE这样的高级代码生成器提出了一个引人注目的主张:用简单的高级语言编写程序,并“免费”获得高性能的GPU代码。他们通过将输入语言限制在特定领域(如Halide中的图像和数组处理)或一组固定的灵活并行模式(如Lift和RISE)来实现这一壮举。实现生成高性能代码的高级代码生成器具有挑战性,特别是随着目标硬件的不断发展。在本文中,我们讨论了如何系统地扩展RISE高级代码生成器,以支持张量核,这是最近Nvidia gpu的专用硬件功能。我们强调了RISE的设计,通过遵循系统的自底向上方法,使其易于扩展,首先,将命令式张量核心API暴露给代码生成器,然后,将抽象提升到内部低级函数表示,最后,由从高级函数程序开始的重写过程作为目标。我们的实验评估表明,支持张量核心的RISE生成的代码与手动优化的CUDA代码相比,性能具有竞争力,最高可达36%,但平均只有10%,比Nvidia高度优化的cuBLAS库慢,并且明显优于任何不利用张量核心的代码。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信