技术讲座:用户驱动在线内核融合的SYCL扩展

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI:10.1145/3585341.3585377

Victor Perez, Lukas Sommer, Victor Lomüler, Kumudha Narasimhan, M. Goli

{"title":"技术讲座:用户驱动在线内核融合的SYCL扩展","authors":"Victor Perez, Lukas Sommer, Victor Lomüler, Kumudha Narasimhan, M. Goli","doi":"10.1145/3585341.3585377","DOIUrl":null,"url":null,"abstract":"Heterogeneous programming models such as SYCL allow developers to integrate a variety of accelerators found in today’s heterogeneous systems into an application with ease. However, while offloading specific tasks to specialized accelerators can deliver significant performance improvements for many applications, short-running device kernels remain a challenge for most heterogeneous programming models. Each invocation of a device kernel is linked to some overhead, caused by the necessary data-transfers, kernel launch and synchronization between host and device. In particular, for a sequence of short-running kernels, this can lead to an unfavourable ratio of overhead and actual computation, resulting in performance degradation. One potential solution to address this problem is to merge multiple small, memory-bound, short-running kernels into a single larger kernel. This leads to better use of the device’s resources and amortizes the device launch overhead. Yet, manually creating fused kernels can be an error-prone, challenging task for developers, and the resulting kernels are less reusable and maintainable. The extension to the SYCL API presented in this talk aims to automate the creation of fused kernels. It provides a mechanism for users or software frameworks using SYCL to instruct the runtime to automatically fuse multiple device kernels at runtime, without the need for manual implementation of the fused kernel. Users or software frameworks can use their application and domain knowledge, as well as runtime context information, to determine when fusion of kernels is legal and profitable, while the actual process of creating a fused kernel is automated by the SYCL runtime. Reducing the kernel launch overhead is however not the only way kernel fusion can improve application performance. The LLVM-based JIT compiler integrated into the SYCL runtime implementation for automatic creation of fused kernels can perform further optimizations. One such optimization is the internalization of dataflow. Intermediate results that originally needed to be communicated via global memory between the different kernels now become internal dataflow of the fused kernel. Replacing slow global memory accesses for this internalized dataflow with faster accesses to local memory or even registers can yield significant performance improvements for many applications. The extension presented in this talk is currently an experimental vendor extension, targeting SYCL version 2020. The initial proof-of-concept implementation was based on Codeplay’s ComputeCpp SYCL implementation and has also been contributed and open-sourced as part of the DPC++ SYCL implementation. To demonstrate the performance improvements unlocked by the extension, two different types of workloads are evaluated on Intel CPU and integrated Intel GPUs. For a set of sixteen typical operator sequences from neural networks with various input sizes, kernel fusion achieves speedups between 0.9x and 2.26x on GPU (geo.-mean 1.35x), and between 1.02x and 3.2x on CPU (geo.-mean 1.78x). For complete neural networks, this translates to 1.19x (Resnet 50) and 1.68x (VGG 16) speedup on CPU, and 1.15x (Resnet 50) and 1.02x (VGG 16) speedup on GPU. For the six benchmarks 3mm, bicg, correlation, covariance, fdtd2d and gramschmidt from the SYCL Bench benchmark suite with different input sizes, fusion achieves speedups between 0.98x and 4.91x on GPU (geo.-mean 1.34x), and speedups between 0.82x and 3.28x on CPU (geo.-mean 1.06x). In summary, this talk presents a SYCL extension automating the creation of fused kernels on user request and shows the potential performance benefits of such an extension on different workloads.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Technical Talk: A SYCL Extension for User-Driven Online Kernel Fusion\",\"authors\":\"Victor Perez, Lukas Sommer, Victor Lomüler, Kumudha Narasimhan, M. Goli\",\"doi\":\"10.1145/3585341.3585377\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Heterogeneous programming models such as SYCL allow developers to integrate a variety of accelerators found in today’s heterogeneous systems into an application with ease. However, while offloading specific tasks to specialized accelerators can deliver significant performance improvements for many applications, short-running device kernels remain a challenge for most heterogeneous programming models. Each invocation of a device kernel is linked to some overhead, caused by the necessary data-transfers, kernel launch and synchronization between host and device. In particular, for a sequence of short-running kernels, this can lead to an unfavourable ratio of overhead and actual computation, resulting in performance degradation. One potential solution to address this problem is to merge multiple small, memory-bound, short-running kernels into a single larger kernel. This leads to better use of the device’s resources and amortizes the device launch overhead. Yet, manually creating fused kernels can be an error-prone, challenging task for developers, and the resulting kernels are less reusable and maintainable. The extension to the SYCL API presented in this talk aims to automate the creation of fused kernels. It provides a mechanism for users or software frameworks using SYCL to instruct the runtime to automatically fuse multiple device kernels at runtime, without the need for manual implementation of the fused kernel. Users or software frameworks can use their application and domain knowledge, as well as runtime context information, to determine when fusion of kernels is legal and profitable, while the actual process of creating a fused kernel is automated by the SYCL runtime. Reducing the kernel launch overhead is however not the only way kernel fusion can improve application performance. The LLVM-based JIT compiler integrated into the SYCL runtime implementation for automatic creation of fused kernels can perform further optimizations. One such optimization is the internalization of dataflow. Intermediate results that originally needed to be communicated via global memory between the different kernels now become internal dataflow of the fused kernel. Replacing slow global memory accesses for this internalized dataflow with faster accesses to local memory or even registers can yield significant performance improvements for many applications. The extension presented in this talk is currently an experimental vendor extension, targeting SYCL version 2020. The initial proof-of-concept implementation was based on Codeplay’s ComputeCpp SYCL implementation and has also been contributed and open-sourced as part of the DPC++ SYCL implementation. To demonstrate the performance improvements unlocked by the extension, two different types of workloads are evaluated on Intel CPU and integrated Intel GPUs. For a set of sixteen typical operator sequences from neural networks with various input sizes, kernel fusion achieves speedups between 0.9x and 2.26x on GPU (geo.-mean 1.35x), and between 1.02x and 3.2x on CPU (geo.-mean 1.78x). For complete neural networks, this translates to 1.19x (Resnet 50) and 1.68x (VGG 16) speedup on CPU, and 1.15x (Resnet 50) and 1.02x (VGG 16) speedup on GPU. For the six benchmarks 3mm, bicg, correlation, covariance, fdtd2d and gramschmidt from the SYCL Bench benchmark suite with different input sizes, fusion achieves speedups between 0.98x and 4.91x on GPU (geo.-mean 1.34x), and speedups between 0.82x and 3.28x on CPU (geo.-mean 1.06x). In summary, this talk presents a SYCL extension automating the creation of fused kernels on user request and shows the potential performance benefits of such an extension on different workloads.\",\"PeriodicalId\":360830,\"journal\":{\"name\":\"Proceedings of the 2023 International Workshop on OpenCL\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3585341.3585377\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585377","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

异构编程模型(如SYCL)允许开发人员轻松地将当今异构系统中的各种加速器集成到应用程序中。然而，虽然将特定任务卸载到专门的加速器可以为许多应用程序带来显著的性能改进，但短时间运行的设备内核仍然是大多数异构编程模型面临的挑战。设备内核的每次调用都与一些开销有关，这些开销是由必要的数据传输、内核启动和主机与设备之间的同步引起的。特别是，对于一系列短时间运行的内核，这可能导致开销和实际计算的不利比例，从而导致性能下降。解决这个问题的一个潜在解决方案是将多个小的、内存受限的、短时间运行的内核合并到一个更大的内核中。这可以更好地利用设备资源，并分摊设备启动开销。然而，对于开发人员来说，手动创建融合内核可能是一项容易出错且具有挑战性的任务，并且生成的内核的可重用性和可维护性较差。本次演讲中介绍的SYCL API的扩展旨在自动创建融合内核。它为使用SYCL的用户或软件框架提供了一种机制，可以指示运行时在运行时自动融合多个设备内核，而无需手动实现融合的内核。用户或软件框架可以使用他们的应用程序和领域知识以及运行时上下文信息来确定内核融合何时合法且有利可图，而创建融合内核的实际过程由SYCL运行时自动化。然而，减少内核启动开销并不是内核融合提高应用程序性能的唯一方法。基于llvm的JIT编译器集成到SYCL运行时实现中，用于自动创建融合内核，可以执行进一步的优化。其中一种优化是数据流的内部化。原先需要在不同内核之间通过全局内存进行通信的中间结果，现在成为融合内核的内部数据流。将这种内部化数据流的缓慢全局内存访问替换为对本地内存甚至寄存器的更快访问，可以显著提高许多应用程序的性能。本次演讲中介绍的扩展目前是一个实验性的供应商扩展，目标是SYCL 2020版本。最初的概念验证实现是基于Codeplay的ComputeCpp SYCL实现的，它也作为dpc++ SYCL实现的一部分被贡献和开源。为了演示扩展带来的性能改进，在英特尔CPU和集成的英特尔gpu上评估了两种不同类型的工作负载。对于来自不同输入大小的神经网络的16个典型算子序列，核融合在GPU (geo)上实现了0.9 ~ 2.26倍的加速。-平均1.35倍)，在1.02倍和3.2倍之间的CPU (geo。意思是1.78 x)。对于完整的神经网络，这意味着在CPU上加速1.19倍(Resnet 50)和1.68倍(VGG 16)，在GPU上加速1.15倍(Resnet 50)和1.09倍(VGG 16)。对于SYCL Bench基准测试套件中具有不同输入大小的6个基准测试(3mm, bicg, correlation, covariance, fdtd2d和gramschmidt)， fusion在GPU (geo)上实现了0.98到4.91倍的加速。-平均1.34倍)，在CPU上的加速在0.82倍到3.28倍之间(geo。意思是1.06 x)。总之，本次演讲介绍了一个SYCL扩展，它可以根据用户请求自动创建融合内核，并展示了这种扩展在不同工作负载下的潜在性能优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Technical Talk: A SYCL Extension for User-Driven Online Kernel Fusion

Heterogeneous programming models such as SYCL allow developers to integrate a variety of accelerators found in today’s heterogeneous systems into an application with ease. However, while offloading specific tasks to specialized accelerators can deliver significant performance improvements for many applications, short-running device kernels remain a challenge for most heterogeneous programming models. Each invocation of a device kernel is linked to some overhead, caused by the necessary data-transfers, kernel launch and synchronization between host and device. In particular, for a sequence of short-running kernels, this can lead to an unfavourable ratio of overhead and actual computation, resulting in performance degradation. One potential solution to address this problem is to merge multiple small, memory-bound, short-running kernels into a single larger kernel. This leads to better use of the device’s resources and amortizes the device launch overhead. Yet, manually creating fused kernels can be an error-prone, challenging task for developers, and the resulting kernels are less reusable and maintainable. The extension to the SYCL API presented in this talk aims to automate the creation of fused kernels. It provides a mechanism for users or software frameworks using SYCL to instruct the runtime to automatically fuse multiple device kernels at runtime, without the need for manual implementation of the fused kernel. Users or software frameworks can use their application and domain knowledge, as well as runtime context information, to determine when fusion of kernels is legal and profitable, while the actual process of creating a fused kernel is automated by the SYCL runtime. Reducing the kernel launch overhead is however not the only way kernel fusion can improve application performance. The LLVM-based JIT compiler integrated into the SYCL runtime implementation for automatic creation of fused kernels can perform further optimizations. One such optimization is the internalization of dataflow. Intermediate results that originally needed to be communicated via global memory between the different kernels now become internal dataflow of the fused kernel. Replacing slow global memory accesses for this internalized dataflow with faster accesses to local memory or even registers can yield significant performance improvements for many applications. The extension presented in this talk is currently an experimental vendor extension, targeting SYCL version 2020. The initial proof-of-concept implementation was based on Codeplay’s ComputeCpp SYCL implementation and has also been contributed and open-sourced as part of the DPC++ SYCL implementation. To demonstrate the performance improvements unlocked by the extension, two different types of workloads are evaluated on Intel CPU and integrated Intel GPUs. For a set of sixteen typical operator sequences from neural networks with various input sizes, kernel fusion achieves speedups between 0.9x and 2.26x on GPU (geo.-mean 1.35x), and between 1.02x and 3.2x on CPU (geo.-mean 1.78x). For complete neural networks, this translates to 1.19x (Resnet 50) and 1.68x (VGG 16) speedup on CPU, and 1.15x (Resnet 50) and 1.02x (VGG 16) speedup on GPU. For the six benchmarks 3mm, bicg, correlation, covariance, fdtd2d and gramschmidt from the SYCL Bench benchmark suite with different input sizes, fusion achieves speedups between 0.98x and 4.91x on GPU (geo.-mean 1.34x), and speedups between 0.82x and 3.28x on CPU (geo.-mean 1.06x). In summary, this talk presents a SYCL extension automating the creation of fused kernels on user request and shows the potential performance benefits of such an extension on different workloads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2023 International Workshop on OpenCL

自引率

0.00%

发文量