Helium: a transparent inter-kernel optimizer for OpenCL

Proceedings of the 8th Workshop on General Purpose Processing using GPUs Pub Date : 2015-02-07 DOI:10.1145/2716282.2716284

Thibaut Lutz, Christian Fensch, M. Cole

{"title":"Helium: a transparent inter-kernel optimizer for OpenCL","authors":"Thibaut Lutz, Christian Fensch, M. Cole","doi":"10.1145/2716282.2716284","DOIUrl":null,"url":null,"abstract":"State of the art automatic optimization of OpenCL applications focuses on improving the performance of individual compute kernels. Programmers address opportunities for inter-kernel optimization in specific applications by ad-hoc hand tuning: manually fusing kernels together. However, the complexity of interactions between host and kernel code makes this approach weak or even unviable for applications involving more than a small number of kernel invocations or a highly dynamic control flow, leaving substantial potential opportunities unexplored. It also leads to an over complex, hard to maintain code base. We present Helium, a transparent OpenCL overlay which discovers, manipulates and exploits opportunities for inter-and intra-kernel optimization. Helium is implemented as preloaded library and uses a delay-optimize-replay mechanism in which kernel calls are intercepted, collectively optimized, and then executed according to an improved execution plan. This allows us to benefit from composite optimizations, on large, dynamically complex applications, with no impact on the code base. Our results show that Helium obtains at least the same, and frequently even better performance, than carefully handtuned code. Helium outperforms hand-optimized code where the exact dynamic composition of compute kernel cannot be known statically. In these cases, we demonstrate speedups of up to 3x over unoptimized code and an average speedup of 1.4x over hand optimized code.","PeriodicalId":432610,"journal":{"name":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2716282.2716284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

State of the art automatic optimization of OpenCL applications focuses on improving the performance of individual compute kernels. Programmers address opportunities for inter-kernel optimization in specific applications by ad-hoc hand tuning: manually fusing kernels together. However, the complexity of interactions between host and kernel code makes this approach weak or even unviable for applications involving more than a small number of kernel invocations or a highly dynamic control flow, leaving substantial potential opportunities unexplored. It also leads to an over complex, hard to maintain code base. We present Helium, a transparent OpenCL overlay which discovers, manipulates and exploits opportunities for inter-and intra-kernel optimization. Helium is implemented as preloaded library and uses a delay-optimize-replay mechanism in which kernel calls are intercepted, collectively optimized, and then executed according to an improved execution plan. This allows us to benefit from composite optimizations, on large, dynamically complex applications, with no impact on the code base. Our results show that Helium obtains at least the same, and frequently even better performance, than carefully handtuned code. Helium outperforms hand-optimized code where the exact dynamic composition of compute kernel cannot be known statically. In these cases, we demonstrate speedups of up to 3x over unoptimized code and an average speedup of 1.4x over hand optimized code.

查看原文本刊更多论文

氦:一个透明的OpenCL内核间优化器

最先进的OpenCL应用程序的自动优化着重于提高单个计算内核的性能。程序员通过特别的手动调优来解决特定应用程序中内核间优化的机会:手动将内核融合在一起。然而，主机和内核代码之间交互的复杂性使得这种方法对于涉及少量内核调用或高度动态控制流的应用程序来说很弱，甚至不可行，从而使大量潜在的机会没有得到开发。它还会导致过于复杂、难以维护的代码库。我们提出了Helium，一个透明的OpenCL覆盖层，它可以发现、操纵和利用内核内部和内部优化的机会。Helium是作为预加载库实现的，并使用延迟优化重放机制，在该机制中，内核调用被拦截、集体优化，然后根据改进的执行计划执行。这使我们能够在不影响代码库的情况下，从大型、动态复杂的应用程序的组合优化中获益。我们的结果表明，Helium至少可以获得与精心调整的代码相同甚至更好的性能。在无法静态地知道计算内核的确切动态组成的情况下，Helium优于手工优化的代码。在这些情况下，我们演示了比未优化代码的加速高达3倍，比手动优化代码的平均加速1.4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th Workshop on General Purpose Processing using GPUs

自引率

0.00%

发文量