One Pass to Bind Them: The First Single-Pass SYCL Compiler with Unified Code Representation Across Backends

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI:10.1145/3585341.3585351

Aksel Alpay, Vincent Heuveline

{"title":"One Pass to Bind Them: The First Single-Pass SYCL Compiler with Unified Code Representation Across Backends","authors":"Aksel Alpay, Vincent Heuveline","doi":"10.1145/3585341.3585351","DOIUrl":null,"url":null,"abstract":"Current SYCL implementations rely on multiple compiler invocations to generate code for host and device, and typically even employ one compiler invocation per required backend code format such as SPIR-V, PTX or amdgcn. This makes generating “universal” binaries that can run on all devices supported by a SYCL implementation very time-consuming, or outright impractical. The ability to generate such universal binaries is however important e.g. when a software vendor wishes to distribute binaries to users that rely on unknown hardware configurations. To address this issue, we present the very first SYCL implementation with a single-source, single compiler pass (SSCP) design and a unified code representation across backends. This allows a single compiler invocation to generate a binary that can execute kernels on all supported devices, dramatically reducing both compile times as well as the user effort required to generate such universal binaries. Our work is publicly available as part of the hipSYCL implementation of SYCL, and supports Intel GPUs through SPIR-V, NVIDIA GPUs through CUDA PTX and AMD GPUs through ROCm amdgcn code. Our new compiler operates in two phases: At compile time, during the regular host compilation pass, it extracts the LLVM IR of kernels. This IR is then stored in a backend-independent fashion in the host binary. At runtime, the embedded LLVM IR is then lowered to the format required by backend drivers (e.g. PTX, SPIR-V, amdgcn). This approach enables portability of a single code representation even if backends do not support a common code format, while still allowing interoperability with vendor-specific optimized libraries. We find that our new compiler can generate highly portable binaries that run on any NVIDIA, Intel or AMD ROCm GPU with only 20% additional compilation time compared to a regular clang host compilation. On our test system, this is roughly 2.2 × faster than compiling with the existing hipSYCL compiler for just three AMD GPUs. We also show that the cost of the additional runtime compilation steps can be expected to be approximately comparable to the cost of runtime compilation that backend drivers already perform today, e.g. to lower SPIR-V to machine code. Lastly, we present early performance results on four different GPUs from three vendors. We find that performance is usually within 10% of current multipass SYCL compiler techniques, with the maximum deviations ranging from a performance regression of 13% to a speedup of 27%. This implies that compared to current SYCL compilation techniques, our new compiler achieves similar performance while substantially decreasing compile times, and increasing the portability of generated binaries.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"1224 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585351","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Current SYCL implementations rely on multiple compiler invocations to generate code for host and device, and typically even employ one compiler invocation per required backend code format such as SPIR-V, PTX or amdgcn. This makes generating “universal” binaries that can run on all devices supported by a SYCL implementation very time-consuming, or outright impractical. The ability to generate such universal binaries is however important e.g. when a software vendor wishes to distribute binaries to users that rely on unknown hardware configurations. To address this issue, we present the very first SYCL implementation with a single-source, single compiler pass (SSCP) design and a unified code representation across backends. This allows a single compiler invocation to generate a binary that can execute kernels on all supported devices, dramatically reducing both compile times as well as the user effort required to generate such universal binaries. Our work is publicly available as part of the hipSYCL implementation of SYCL, and supports Intel GPUs through SPIR-V, NVIDIA GPUs through CUDA PTX and AMD GPUs through ROCm amdgcn code. Our new compiler operates in two phases: At compile time, during the regular host compilation pass, it extracts the LLVM IR of kernels. This IR is then stored in a backend-independent fashion in the host binary. At runtime, the embedded LLVM IR is then lowered to the format required by backend drivers (e.g. PTX, SPIR-V, amdgcn). This approach enables portability of a single code representation even if backends do not support a common code format, while still allowing interoperability with vendor-specific optimized libraries. We find that our new compiler can generate highly portable binaries that run on any NVIDIA, Intel or AMD ROCm GPU with only 20% additional compilation time compared to a regular clang host compilation. On our test system, this is roughly 2.2 × faster than compiling with the existing hipSYCL compiler for just three AMD GPUs. We also show that the cost of the additional runtime compilation steps can be expected to be approximately comparable to the cost of runtime compilation that backend drivers already perform today, e.g. to lower SPIR-V to machine code. Lastly, we present early performance results on four different GPUs from three vendors. We find that performance is usually within 10% of current multipass SYCL compiler techniques, with the maximum deviations ranging from a performance regression of 13% to a speedup of 27%. This implies that compared to current SYCL compilation techniques, our new compiler achieves similar performance while substantially decreasing compile times, and increasing the portability of generated binaries.

查看原文本刊更多论文

一遍绑定:第一个跨后端统一代码表示的单遍SYCL编译器

当前的SYCL实现依赖于多个编译器调用来为主机和设备生成代码，通常甚至为每个所需的后端代码格式(如SPIR-V、PTX或amdgcn)使用一个编译器调用。这使得生成可以在SYCL实现支持的所有设备上运行的“通用”二进制文件非常耗时，或者完全不切实际。然而，生成这种通用二进制文件的能力是很重要的，例如当软件供应商希望将二进制文件分发给依赖未知硬件配置的用户时。为了解决这个问题，我们提出了第一个SYCL实现，它具有单源、单编译器通道(SSCP)设计和跨后端统一的代码表示。这允许单个编译器调用生成可以在所有受支持的设备上执行内核的二进制文件，从而大大减少了编译时间和生成这种通用二进制文件所需的用户工作量。我们的工作作为SYCL的hipSYCL实现的一部分公开可用，并通过spil - v支持英特尔gpu，通过CUDA PTX支持NVIDIA gpu，通过ROCm amdgcn代码支持AMD gpu。我们的新编译器分为两个阶段:在编译时，在常规的主机编译过程中，它提取内核的LLVM IR。然后，该IR以独立于后端的方式存储在主机二进制文件中。在运行时，嵌入式LLVM IR被降低为后端驱动程序(例如PTX, SPIR-V, amdgcn)所需的格式。这种方法支持单个代码表示的可移植性，即使后端不支持通用代码格式，同时仍然允许与特定于供应商的优化库进行互操作性。我们发现，我们的新编译器可以生成高度可移植的二进制文件，可以在任何NVIDIA、Intel或AMD的ROCm GPU上运行，与常规的clang主机编译相比，只需要额外20%的编译时间。在我们的测试系统上，这比使用现有的hipSYCL编译器在三个AMD gpu上编译大约快2.2倍。我们还表明，额外的运行时编译步骤的成本可以预期与后端驱动程序目前已经执行的运行时编译的成本大致相当，例如，将SPIR-V降低到机器码。最后，我们展示了来自三家供应商的四种不同gpu的早期性能结果。我们发现，性能通常在当前多通道SYCL编译器技术的10%以内，最大偏差范围从13%的性能回归到27%的加速。这意味着与当前的SYCL编译技术相比，我们的新编译器实现了类似的性能，同时大大减少了编译时间，并提高了生成的二进制文件的可移植性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2023 International Workshop on OpenCL

自引率

0.00%

发文量