{"title":"OpenCL GPU内核的启动时优化","authors":"Andrew S. D. Lee, T. Abdelrahman","doi":"10.1145/3038228.3038236","DOIUrl":null,"url":null,"abstract":"OpenCL compiles a GPU kernel first and then launches it for execution, providing the kernel at this launch with its arguments and its launch geometry. Although some of the kernel inputs and the launch geometry remain constant across all threads during execution, the compiler is unable to treat them as such, which limits its ability to apply several optimizations, including constant propagation, constant folding, strength reduction and loop unrolling. In this paper we describe a novel approach to address this problem. At compile-time, the kernel input arguments and variables holding constant values of the launch geometry are identified. The kernel's PTX code is analyzed and is marked with annotations that reflect the actions an optimizer would have performed had the values of the aforementioned variables been compile-time-known constants. At kernel launch time the annotations, combined with the now known values of these variables, are used to optimize the code, thereby improving kernel performance. We compare the execution time of 12 GPU kernels compiled with a standard LLVM-based compilation flow to their execution time when compiled with the same flow, modified to implement our approach. The results show that annotation processing is fast and that kernel performance is improved by a factor of up to 2.13X and on average by 1.17X across the benchmarks. When taking into account the entire compilation flow, the resulting benefit depends on how often a kernel is launched. When the kernel is launched many times with the same arguments and the same geometry, kernel execution time, including the compilation flow, benefits by similar factors. However, when the kernel is launched with different arguments and/or geometries, performance suffers because of the overhead of repeated PTX-to-Cubin compilation.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"184 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Launch-Time Optimization of OpenCL GPU Kernels\",\"authors\":\"Andrew S. D. Lee, T. Abdelrahman\",\"doi\":\"10.1145/3038228.3038236\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"OpenCL compiles a GPU kernel first and then launches it for execution, providing the kernel at this launch with its arguments and its launch geometry. Although some of the kernel inputs and the launch geometry remain constant across all threads during execution, the compiler is unable to treat them as such, which limits its ability to apply several optimizations, including constant propagation, constant folding, strength reduction and loop unrolling. In this paper we describe a novel approach to address this problem. At compile-time, the kernel input arguments and variables holding constant values of the launch geometry are identified. The kernel's PTX code is analyzed and is marked with annotations that reflect the actions an optimizer would have performed had the values of the aforementioned variables been compile-time-known constants. At kernel launch time the annotations, combined with the now known values of these variables, are used to optimize the code, thereby improving kernel performance. We compare the execution time of 12 GPU kernels compiled with a standard LLVM-based compilation flow to their execution time when compiled with the same flow, modified to implement our approach. The results show that annotation processing is fast and that kernel performance is improved by a factor of up to 2.13X and on average by 1.17X across the benchmarks. When taking into account the entire compilation flow, the resulting benefit depends on how often a kernel is launched. When the kernel is launched many times with the same arguments and the same geometry, kernel execution time, including the compilation flow, benefits by similar factors. However, when the kernel is launched with different arguments and/or geometries, performance suffers because of the overhead of repeated PTX-to-Cubin compilation.\",\"PeriodicalId\":108772,\"journal\":{\"name\":\"Proceedings of the General Purpose GPUs\",\"volume\":\"184 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the General Purpose GPUs\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3038228.3038236\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3038228.3038236","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
OpenCL compiles a GPU kernel first and then launches it for execution, providing the kernel at this launch with its arguments and its launch geometry. Although some of the kernel inputs and the launch geometry remain constant across all threads during execution, the compiler is unable to treat them as such, which limits its ability to apply several optimizations, including constant propagation, constant folding, strength reduction and loop unrolling. In this paper we describe a novel approach to address this problem. At compile-time, the kernel input arguments and variables holding constant values of the launch geometry are identified. The kernel's PTX code is analyzed and is marked with annotations that reflect the actions an optimizer would have performed had the values of the aforementioned variables been compile-time-known constants. At kernel launch time the annotations, combined with the now known values of these variables, are used to optimize the code, thereby improving kernel performance. We compare the execution time of 12 GPU kernels compiled with a standard LLVM-based compilation flow to their execution time when compiled with the same flow, modified to implement our approach. The results show that annotation processing is fast and that kernel performance is improved by a factor of up to 2.13X and on average by 1.17X across the benchmarks. When taking into account the entire compilation flow, the resulting benefit depends on how often a kernel is launched. When the kernel is launched many times with the same arguments and the same geometry, kernel execution time, including the compilation flow, benefits by similar factors. However, when the kernel is launched with different arguments and/or geometries, performance suffers because of the overhead of repeated PTX-to-Cubin compilation.