{"title":"VkFFT和超越-运行时GPU代码生成的平台","authors":"Dmitrii Tolmachev","doi":"10.1145/3585341.3585357","DOIUrl":null,"url":null,"abstract":"This talk will present the VkFFT version 1.3 and the new platform for runtime GPU code generation it is based on. The main reason for this update is to make algorithms implemented in VkFFT available for many other GPU applications and standardize the way the code is generated in it. The platform presented allows fine-tuning of the algorithms for a particular GPU and API they are executed on at runtime. It aims to make it easier for competent GPU programmers to express themselves to different APIs, as the design logic of modern GPUs is fairly similar between all vendors. This is the main difference between the platform and other existing API-independent ways to write code, as they usually aim at fast prototyping and simple optimizations under the hood for beginner-level GPU programmers. The platform has a hierarchical structure design: Application -> Plan -> Code. At the application stage, the platform performs all interactions with the user and resources management. This includes configuration parsing, calls to the application initialization, update, dispatch and deletion with optional binary caching. The plan stage is the internal configuration stage that constructs the intermediate representation of the problem to be solved. This includes all algorithm decision-making, resource allocation, calls to the code generator and code compilation. The code generation stage produces a string that will hold GPU code for a particular API that can be later compiled and used. It is further divided into multiple levels: level 2 subkernels – a clear description of the problem via a sequence of calls to lower levels; level 1 subkernels – simple routines: matrix-vector multiplication, FFT, pre- and post-processing, R2C/R2R mappings; level 0 subkernels – memory management, basic math, functions inlining, API-dependent definitions. The code generator operates on special data containers, that can hold either known during the plan creation integer/float values or strings of variable names. Using a multiplication operation that performs A=B*C as an example, if all containers have known values, A can be precomputed during plan creation. If A, B and C are register names, we print to the kernel an operation of multiplication to be executed. This talk will also discuss multiple algorithms implemented with this platform. On the example of VkFFT we will demonstrate the overall platform structure and the general GPU application design guidelines, mainly related to optimization of memory layout, such as having no CPU-GPU transfers during execution except for asynchronous downloads from the GPU, minimization of GPU dedicated memory-L2-L1 communication and maximization of on-chip memory usage. To go even further, we will demonstrate how a finite difference solver can be implemented with a help of the platform using only low-level warp shuffling instructions to perform on-chip data transfers instead of using the shared memory of the streaming multiprocessor (on-chip memory accessible by all threads). This considerably reduces the number of communications between threads, which can be a performance-limiting factor for high-order schemes. We will demonstrate the benchmark comparison of warp communication performance of modern GPUs, including high-end HPC GPUs from Nvidia and AMD and consumer-level solutions.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VkFFT and beyond - a platform for runtime GPU code generation\",\"authors\":\"Dmitrii Tolmachev\",\"doi\":\"10.1145/3585341.3585357\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This talk will present the VkFFT version 1.3 and the new platform for runtime GPU code generation it is based on. The main reason for this update is to make algorithms implemented in VkFFT available for many other GPU applications and standardize the way the code is generated in it. The platform presented allows fine-tuning of the algorithms for a particular GPU and API they are executed on at runtime. It aims to make it easier for competent GPU programmers to express themselves to different APIs, as the design logic of modern GPUs is fairly similar between all vendors. This is the main difference between the platform and other existing API-independent ways to write code, as they usually aim at fast prototyping and simple optimizations under the hood for beginner-level GPU programmers. The platform has a hierarchical structure design: Application -> Plan -> Code. At the application stage, the platform performs all interactions with the user and resources management. This includes configuration parsing, calls to the application initialization, update, dispatch and deletion with optional binary caching. The plan stage is the internal configuration stage that constructs the intermediate representation of the problem to be solved. This includes all algorithm decision-making, resource allocation, calls to the code generator and code compilation. The code generation stage produces a string that will hold GPU code for a particular API that can be later compiled and used. It is further divided into multiple levels: level 2 subkernels – a clear description of the problem via a sequence of calls to lower levels; level 1 subkernels – simple routines: matrix-vector multiplication, FFT, pre- and post-processing, R2C/R2R mappings; level 0 subkernels – memory management, basic math, functions inlining, API-dependent definitions. The code generator operates on special data containers, that can hold either known during the plan creation integer/float values or strings of variable names. Using a multiplication operation that performs A=B*C as an example, if all containers have known values, A can be precomputed during plan creation. If A, B and C are register names, we print to the kernel an operation of multiplication to be executed. This talk will also discuss multiple algorithms implemented with this platform. On the example of VkFFT we will demonstrate the overall platform structure and the general GPU application design guidelines, mainly related to optimization of memory layout, such as having no CPU-GPU transfers during execution except for asynchronous downloads from the GPU, minimization of GPU dedicated memory-L2-L1 communication and maximization of on-chip memory usage. To go even further, we will demonstrate how a finite difference solver can be implemented with a help of the platform using only low-level warp shuffling instructions to perform on-chip data transfers instead of using the shared memory of the streaming multiprocessor (on-chip memory accessible by all threads). This considerably reduces the number of communications between threads, which can be a performance-limiting factor for high-order schemes. We will demonstrate the benchmark comparison of warp communication performance of modern GPUs, including high-end HPC GPUs from Nvidia and AMD and consumer-level solutions.\",\"PeriodicalId\":360830,\"journal\":{\"name\":\"Proceedings of the 2023 International Workshop on OpenCL\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3585341.3585357\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
VkFFT and beyond - a platform for runtime GPU code generation
This talk will present the VkFFT version 1.3 and the new platform for runtime GPU code generation it is based on. The main reason for this update is to make algorithms implemented in VkFFT available for many other GPU applications and standardize the way the code is generated in it. The platform presented allows fine-tuning of the algorithms for a particular GPU and API they are executed on at runtime. It aims to make it easier for competent GPU programmers to express themselves to different APIs, as the design logic of modern GPUs is fairly similar between all vendors. This is the main difference between the platform and other existing API-independent ways to write code, as they usually aim at fast prototyping and simple optimizations under the hood for beginner-level GPU programmers. The platform has a hierarchical structure design: Application -> Plan -> Code. At the application stage, the platform performs all interactions with the user and resources management. This includes configuration parsing, calls to the application initialization, update, dispatch and deletion with optional binary caching. The plan stage is the internal configuration stage that constructs the intermediate representation of the problem to be solved. This includes all algorithm decision-making, resource allocation, calls to the code generator and code compilation. The code generation stage produces a string that will hold GPU code for a particular API that can be later compiled and used. It is further divided into multiple levels: level 2 subkernels – a clear description of the problem via a sequence of calls to lower levels; level 1 subkernels – simple routines: matrix-vector multiplication, FFT, pre- and post-processing, R2C/R2R mappings; level 0 subkernels – memory management, basic math, functions inlining, API-dependent definitions. The code generator operates on special data containers, that can hold either known during the plan creation integer/float values or strings of variable names. Using a multiplication operation that performs A=B*C as an example, if all containers have known values, A can be precomputed during plan creation. If A, B and C are register names, we print to the kernel an operation of multiplication to be executed. This talk will also discuss multiple algorithms implemented with this platform. On the example of VkFFT we will demonstrate the overall platform structure and the general GPU application design guidelines, mainly related to optimization of memory layout, such as having no CPU-GPU transfers during execution except for asynchronous downloads from the GPU, minimization of GPU dedicated memory-L2-L1 communication and maximization of on-chip memory usage. To go even further, we will demonstrate how a finite difference solver can be implemented with a help of the platform using only low-level warp shuffling instructions to perform on-chip data transfers instead of using the shared memory of the streaming multiprocessor (on-chip memory accessible by all threads). This considerably reduces the number of communications between threads, which can be a performance-limiting factor for high-order schemes. We will demonstrate the benchmark comparison of warp communication performance of modern GPUs, including high-end HPC GPUs from Nvidia and AMD and consumer-level solutions.