VkFFT和超越-运行时GPU代码生成的平台

Dmitrii Tolmachev
{"title":"VkFFT和超越-运行时GPU代码生成的平台","authors":"Dmitrii Tolmachev","doi":"10.1145/3585341.3585357","DOIUrl":null,"url":null,"abstract":"This talk will present the VkFFT version 1.3 and the new platform for runtime GPU code generation it is based on. The main reason for this update is to make algorithms implemented in VkFFT available for many other GPU applications and standardize the way the code is generated in it. The platform presented allows fine-tuning of the algorithms for a particular GPU and API they are executed on at runtime. It aims to make it easier for competent GPU programmers to express themselves to different APIs, as the design logic of modern GPUs is fairly similar between all vendors. This is the main difference between the platform and other existing API-independent ways to write code, as they usually aim at fast prototyping and simple optimizations under the hood for beginner-level GPU programmers. The platform has a hierarchical structure design: Application -> Plan -> Code. At the application stage, the platform performs all interactions with the user and resources management. This includes configuration parsing, calls to the application initialization, update, dispatch and deletion with optional binary caching. The plan stage is the internal configuration stage that constructs the intermediate representation of the problem to be solved. This includes all algorithm decision-making, resource allocation, calls to the code generator and code compilation. The code generation stage produces a string that will hold GPU code for a particular API that can be later compiled and used. It is further divided into multiple levels: level 2 subkernels – a clear description of the problem via a sequence of calls to lower levels; level 1 subkernels – simple routines: matrix-vector multiplication, FFT, pre- and post-processing, R2C/R2R mappings; level 0 subkernels – memory management, basic math, functions inlining, API-dependent definitions. The code generator operates on special data containers, that can hold either known during the plan creation integer/float values or strings of variable names. Using a multiplication operation that performs A=B*C as an example, if all containers have known values, A can be precomputed during plan creation. If A, B and C are register names, we print to the kernel an operation of multiplication to be executed. This talk will also discuss multiple algorithms implemented with this platform. On the example of VkFFT we will demonstrate the overall platform structure and the general GPU application design guidelines, mainly related to optimization of memory layout, such as having no CPU-GPU transfers during execution except for asynchronous downloads from the GPU, minimization of GPU dedicated memory-L2-L1 communication and maximization of on-chip memory usage. To go even further, we will demonstrate how a finite difference solver can be implemented with a help of the platform using only low-level warp shuffling instructions to perform on-chip data transfers instead of using the shared memory of the streaming multiprocessor (on-chip memory accessible by all threads). This considerably reduces the number of communications between threads, which can be a performance-limiting factor for high-order schemes. We will demonstrate the benchmark comparison of warp communication performance of modern GPUs, including high-end HPC GPUs from Nvidia and AMD and consumer-level solutions.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VkFFT and beyond - a platform for runtime GPU code generation\",\"authors\":\"Dmitrii Tolmachev\",\"doi\":\"10.1145/3585341.3585357\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This talk will present the VkFFT version 1.3 and the new platform for runtime GPU code generation it is based on. The main reason for this update is to make algorithms implemented in VkFFT available for many other GPU applications and standardize the way the code is generated in it. The platform presented allows fine-tuning of the algorithms for a particular GPU and API they are executed on at runtime. It aims to make it easier for competent GPU programmers to express themselves to different APIs, as the design logic of modern GPUs is fairly similar between all vendors. This is the main difference between the platform and other existing API-independent ways to write code, as they usually aim at fast prototyping and simple optimizations under the hood for beginner-level GPU programmers. The platform has a hierarchical structure design: Application -> Plan -> Code. At the application stage, the platform performs all interactions with the user and resources management. This includes configuration parsing, calls to the application initialization, update, dispatch and deletion with optional binary caching. The plan stage is the internal configuration stage that constructs the intermediate representation of the problem to be solved. This includes all algorithm decision-making, resource allocation, calls to the code generator and code compilation. The code generation stage produces a string that will hold GPU code for a particular API that can be later compiled and used. It is further divided into multiple levels: level 2 subkernels – a clear description of the problem via a sequence of calls to lower levels; level 1 subkernels – simple routines: matrix-vector multiplication, FFT, pre- and post-processing, R2C/R2R mappings; level 0 subkernels – memory management, basic math, functions inlining, API-dependent definitions. The code generator operates on special data containers, that can hold either known during the plan creation integer/float values or strings of variable names. Using a multiplication operation that performs A=B*C as an example, if all containers have known values, A can be precomputed during plan creation. If A, B and C are register names, we print to the kernel an operation of multiplication to be executed. This talk will also discuss multiple algorithms implemented with this platform. On the example of VkFFT we will demonstrate the overall platform structure and the general GPU application design guidelines, mainly related to optimization of memory layout, such as having no CPU-GPU transfers during execution except for asynchronous downloads from the GPU, minimization of GPU dedicated memory-L2-L1 communication and maximization of on-chip memory usage. To go even further, we will demonstrate how a finite difference solver can be implemented with a help of the platform using only low-level warp shuffling instructions to perform on-chip data transfers instead of using the shared memory of the streaming multiprocessor (on-chip memory accessible by all threads). This considerably reduces the number of communications between threads, which can be a performance-limiting factor for high-order schemes. We will demonstrate the benchmark comparison of warp communication performance of modern GPUs, including high-end HPC GPUs from Nvidia and AMD and consumer-level solutions.\",\"PeriodicalId\":360830,\"journal\":{\"name\":\"Proceedings of the 2023 International Workshop on OpenCL\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3585341.3585357\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本演讲将介绍VkFFT 1.3版本和它所基于的运行时GPU代码生成的新平台。这次更新的主要原因是使VkFFT中实现的算法可用于许多其他GPU应用程序,并标准化代码在其中生成的方式。提出的平台允许在运行时对特定GPU和API的算法进行微调。它的目的是让有能力的GPU程序员更容易用不同的api表达自己,因为现代GPU的设计逻辑在所有供应商之间相当相似。这是该平台与其他现有的独立于api的代码编写方式之间的主要区别,因为它们通常针对初级级GPU程序员的快速原型和简单优化。平台采用层次结构设计:应用->计划->代码。在应用阶段,平台执行与用户和资源管理的所有交互。这包括配置解析、调用应用程序初始化、更新、调度和删除可选的二进制缓存。计划阶段是构建待解决问题的中间表示的内部配置阶段。这包括所有的算法决策,资源分配,调用代码生成器和代码编译。代码生成阶段产生一个字符串,该字符串将保存特定API的GPU代码,可以稍后编译和使用。它进一步分为多个级别:第2级子内核——通过对较低级别的一系列调用来清晰地描述问题;1级子核-简单的例程:矩阵-向量乘法,FFT,预处理和后处理,R2C/R2R映射;0级子内核-内存管理,基本数学,函数内联,依赖api的定义。代码生成器对特殊的数据容器进行操作,这些容器可以保存在计划创建期间已知的整数/浮点值或变量名字符串。以执行a =B*C的乘法运算为例,如果所有容器都有已知值,则可以在创建计划时预先计算a。如果A、B和C是寄存器名,则向内核输出要执行的乘法操作。本讲座还将讨论在该平台上实现的多种算法。在VkFFT的例子中,我们将展示整个平台结构和通用GPU应用程序设计指南,主要与内存布局的优化有关,例如在执行期间除了GPU的异步下载之外没有CPU-GPU传输,最小化GPU专用内存- l2 - l1通信和最大化片上内存使用。为了更进一步,我们将演示如何在平台的帮助下实现有限差分求解器,仅使用低级warp变换指令来执行片上数据传输,而不是使用流多处理器的共享内存(所有线程都可以访问片上内存)。这大大减少了线程之间的通信数量,这可能是高阶方案的性能限制因素。我们将展示现代gpu的warp通信性能的基准比较,包括Nvidia和AMD的高端HPC gpu和消费者级解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
VkFFT and beyond - a platform for runtime GPU code generation
This talk will present the VkFFT version 1.3 and the new platform for runtime GPU code generation it is based on. The main reason for this update is to make algorithms implemented in VkFFT available for many other GPU applications and standardize the way the code is generated in it. The platform presented allows fine-tuning of the algorithms for a particular GPU and API they are executed on at runtime. It aims to make it easier for competent GPU programmers to express themselves to different APIs, as the design logic of modern GPUs is fairly similar between all vendors. This is the main difference between the platform and other existing API-independent ways to write code, as they usually aim at fast prototyping and simple optimizations under the hood for beginner-level GPU programmers. The platform has a hierarchical structure design: Application -> Plan -> Code. At the application stage, the platform performs all interactions with the user and resources management. This includes configuration parsing, calls to the application initialization, update, dispatch and deletion with optional binary caching. The plan stage is the internal configuration stage that constructs the intermediate representation of the problem to be solved. This includes all algorithm decision-making, resource allocation, calls to the code generator and code compilation. The code generation stage produces a string that will hold GPU code for a particular API that can be later compiled and used. It is further divided into multiple levels: level 2 subkernels – a clear description of the problem via a sequence of calls to lower levels; level 1 subkernels – simple routines: matrix-vector multiplication, FFT, pre- and post-processing, R2C/R2R mappings; level 0 subkernels – memory management, basic math, functions inlining, API-dependent definitions. The code generator operates on special data containers, that can hold either known during the plan creation integer/float values or strings of variable names. Using a multiplication operation that performs A=B*C as an example, if all containers have known values, A can be precomputed during plan creation. If A, B and C are register names, we print to the kernel an operation of multiplication to be executed. This talk will also discuss multiple algorithms implemented with this platform. On the example of VkFFT we will demonstrate the overall platform structure and the general GPU application design guidelines, mainly related to optimization of memory layout, such as having no CPU-GPU transfers during execution except for asynchronous downloads from the GPU, minimization of GPU dedicated memory-L2-L1 communication and maximization of on-chip memory usage. To go even further, we will demonstrate how a finite difference solver can be implemented with a help of the platform using only low-level warp shuffling instructions to perform on-chip data transfers instead of using the shared memory of the streaming multiprocessor (on-chip memory accessible by all threads). This considerably reduces the number of communications between threads, which can be a performance-limiting factor for high-order schemes. We will demonstrate the benchmark comparison of warp communication performance of modern GPUs, including high-end HPC GPUs from Nvidia and AMD and consumer-level solutions.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信