OpenCL GPU内核的启动时优化

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI:10.1145/3038228.3038236

Andrew S. D. Lee, T. Abdelrahman

{"title":"OpenCL GPU内核的启动时优化","authors":"Andrew S. D. Lee, T. Abdelrahman","doi":"10.1145/3038228.3038236","DOIUrl":null,"url":null,"abstract":"OpenCL compiles a GPU kernel first and then launches it for execution, providing the kernel at this launch with its arguments and its launch geometry. Although some of the kernel inputs and the launch geometry remain constant across all threads during execution, the compiler is unable to treat them as such, which limits its ability to apply several optimizations, including constant propagation, constant folding, strength reduction and loop unrolling. In this paper we describe a novel approach to address this problem. At compile-time, the kernel input arguments and variables holding constant values of the launch geometry are identified. The kernel's PTX code is analyzed and is marked with annotations that reflect the actions an optimizer would have performed had the values of the aforementioned variables been compile-time-known constants. At kernel launch time the annotations, combined with the now known values of these variables, are used to optimize the code, thereby improving kernel performance. We compare the execution time of 12 GPU kernels compiled with a standard LLVM-based compilation flow to their execution time when compiled with the same flow, modified to implement our approach. The results show that annotation processing is fast and that kernel performance is improved by a factor of up to 2.13X and on average by 1.17X across the benchmarks. When taking into account the entire compilation flow, the resulting benefit depends on how often a kernel is launched. When the kernel is launched many times with the same arguments and the same geometry, kernel execution time, including the compilation flow, benefits by similar factors. However, when the kernel is launched with different arguments and/or geometries, performance suffers because of the overhead of repeated PTX-to-Cubin compilation.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"184 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Launch-Time Optimization of OpenCL GPU Kernels\",\"authors\":\"Andrew S. D. Lee, T. Abdelrahman\",\"doi\":\"10.1145/3038228.3038236\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"OpenCL compiles a GPU kernel first and then launches it for execution, providing the kernel at this launch with its arguments and its launch geometry. Although some of the kernel inputs and the launch geometry remain constant across all threads during execution, the compiler is unable to treat them as such, which limits its ability to apply several optimizations, including constant propagation, constant folding, strength reduction and loop unrolling. In this paper we describe a novel approach to address this problem. At compile-time, the kernel input arguments and variables holding constant values of the launch geometry are identified. The kernel's PTX code is analyzed and is marked with annotations that reflect the actions an optimizer would have performed had the values of the aforementioned variables been compile-time-known constants. At kernel launch time the annotations, combined with the now known values of these variables, are used to optimize the code, thereby improving kernel performance. We compare the execution time of 12 GPU kernels compiled with a standard LLVM-based compilation flow to their execution time when compiled with the same flow, modified to implement our approach. The results show that annotation processing is fast and that kernel performance is improved by a factor of up to 2.13X and on average by 1.17X across the benchmarks. When taking into account the entire compilation flow, the resulting benefit depends on how often a kernel is launched. When the kernel is launched many times with the same arguments and the same geometry, kernel execution time, including the compilation flow, benefits by similar factors. However, when the kernel is launched with different arguments and/or geometries, performance suffers because of the overhead of repeated PTX-to-Cubin compilation.\",\"PeriodicalId\":108772,\"journal\":{\"name\":\"Proceedings of the General Purpose GPUs\",\"volume\":\"184 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the General Purpose GPUs\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3038228.3038236\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3038228.3038236","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

OpenCL首先编译一个GPU内核，然后启动它来执行，在这个启动时为内核提供它的参数和它的启动几何。尽管在执行期间，一些内核输入和启动几何形状在所有线程中保持不变，但编译器无法将它们视为恒定的，这限制了它应用若干优化的能力，包括恒定传播、恒定折叠、强度降低和循环展开。在本文中，我们描述了一种新的方法来解决这个问题。在编译时，内核输入参数和保存发射几何形状常量值的变量被识别。对内核的PTX代码进行分析，并用注释进行标记，这些注释反映了如果上述变量的值是编译时已知的常量，优化器将执行的操作。在内核启动时，这些注释与这些变量的已知值一起用于优化代码，从而提高内核性能。我们比较了使用标准的基于llvm的编译流编译的12个GPU内核的执行时间，以及使用经过修改以实现我们的方法的相同流程编译时的执行时间。结果表明，注释处理速度很快，内核性能在基准测试中提高了2.13倍，平均提高了1.17倍。当考虑到整个编译流程时，所得到的好处取决于内核启动的频率。当使用相同的参数和相同的几何结构多次启动内核时，内核执行时间(包括编译流)也会因类似的因素而受益。但是，当内核以不同的参数和/或几何形状启动时，由于重复的ptx -to- bin编译的开销，性能会受到影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Launch-Time Optimization of OpenCL GPU Kernels

OpenCL compiles a GPU kernel first and then launches it for execution, providing the kernel at this launch with its arguments and its launch geometry. Although some of the kernel inputs and the launch geometry remain constant across all threads during execution, the compiler is unable to treat them as such, which limits its ability to apply several optimizations, including constant propagation, constant folding, strength reduction and loop unrolling. In this paper we describe a novel approach to address this problem. At compile-time, the kernel input arguments and variables holding constant values of the launch geometry are identified. The kernel's PTX code is analyzed and is marked with annotations that reflect the actions an optimizer would have performed had the values of the aforementioned variables been compile-time-known constants. At kernel launch time the annotations, combined with the now known values of these variables, are used to optimize the code, thereby improving kernel performance. We compare the execution time of 12 GPU kernels compiled with a standard LLVM-based compilation flow to their execution time when compiled with the same flow, modified to implement our approach. The results show that annotation processing is fast and that kernel performance is improved by a factor of up to 2.13X and on average by 1.17X across the benchmarks. When taking into account the entire compilation flow, the resulting benefit depends on how often a kernel is launched. When the kernel is launched many times with the same arguments and the same geometry, kernel execution time, including the compilation flow, benefits by similar factors. However, when the kernel is launched with different arguments and/or geometries, performance suffers because of the overhead of repeated PTX-to-Cubin compilation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the General Purpose GPUs

自引率

0.00%

发文量