{"title":"One Pass to Bind Them: The First Single-Pass SYCL Compiler with Unified Code Representation Across Backends","authors":"Aksel Alpay, Vincent Heuveline","doi":"10.1145/3585341.3585351","DOIUrl":null,"url":null,"abstract":"Current SYCL implementations rely on multiple compiler invocations to generate code for host and device, and typically even employ one compiler invocation per required backend code format such as SPIR-V, PTX or amdgcn. This makes generating “universal” binaries that can run on all devices supported by a SYCL implementation very time-consuming, or outright impractical. The ability to generate such universal binaries is however important e.g. when a software vendor wishes to distribute binaries to users that rely on unknown hardware configurations. To address this issue, we present the very first SYCL implementation with a single-source, single compiler pass (SSCP) design and a unified code representation across backends. This allows a single compiler invocation to generate a binary that can execute kernels on all supported devices, dramatically reducing both compile times as well as the user effort required to generate such universal binaries. Our work is publicly available as part of the hipSYCL implementation of SYCL, and supports Intel GPUs through SPIR-V, NVIDIA GPUs through CUDA PTX and AMD GPUs through ROCm amdgcn code. Our new compiler operates in two phases: At compile time, during the regular host compilation pass, it extracts the LLVM IR of kernels. This IR is then stored in a backend-independent fashion in the host binary. At runtime, the embedded LLVM IR is then lowered to the format required by backend drivers (e.g. PTX, SPIR-V, amdgcn). This approach enables portability of a single code representation even if backends do not support a common code format, while still allowing interoperability with vendor-specific optimized libraries. We find that our new compiler can generate highly portable binaries that run on any NVIDIA, Intel or AMD ROCm GPU with only 20% additional compilation time compared to a regular clang host compilation. On our test system, this is roughly 2.2 × faster than compiling with the existing hipSYCL compiler for just three AMD GPUs. We also show that the cost of the additional runtime compilation steps can be expected to be approximately comparable to the cost of runtime compilation that backend drivers already perform today, e.g. to lower SPIR-V to machine code. Lastly, we present early performance results on four different GPUs from three vendors. We find that performance is usually within 10% of current multipass SYCL compiler techniques, with the maximum deviations ranging from a performance regression of 13% to a speedup of 27%. This implies that compared to current SYCL compilation techniques, our new compiler achieves similar performance while substantially decreasing compile times, and increasing the portability of generated binaries.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"1224 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585351","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Current SYCL implementations rely on multiple compiler invocations to generate code for host and device, and typically even employ one compiler invocation per required backend code format such as SPIR-V, PTX or amdgcn. This makes generating “universal” binaries that can run on all devices supported by a SYCL implementation very time-consuming, or outright impractical. The ability to generate such universal binaries is however important e.g. when a software vendor wishes to distribute binaries to users that rely on unknown hardware configurations. To address this issue, we present the very first SYCL implementation with a single-source, single compiler pass (SSCP) design and a unified code representation across backends. This allows a single compiler invocation to generate a binary that can execute kernels on all supported devices, dramatically reducing both compile times as well as the user effort required to generate such universal binaries. Our work is publicly available as part of the hipSYCL implementation of SYCL, and supports Intel GPUs through SPIR-V, NVIDIA GPUs through CUDA PTX and AMD GPUs through ROCm amdgcn code. Our new compiler operates in two phases: At compile time, during the regular host compilation pass, it extracts the LLVM IR of kernels. This IR is then stored in a backend-independent fashion in the host binary. At runtime, the embedded LLVM IR is then lowered to the format required by backend drivers (e.g. PTX, SPIR-V, amdgcn). This approach enables portability of a single code representation even if backends do not support a common code format, while still allowing interoperability with vendor-specific optimized libraries. We find that our new compiler can generate highly portable binaries that run on any NVIDIA, Intel or AMD ROCm GPU with only 20% additional compilation time compared to a regular clang host compilation. On our test system, this is roughly 2.2 × faster than compiling with the existing hipSYCL compiler for just three AMD GPUs. We also show that the cost of the additional runtime compilation steps can be expected to be approximately comparable to the cost of runtime compilation that backend drivers already perform today, e.g. to lower SPIR-V to machine code. Lastly, we present early performance results on four different GPUs from three vendors. We find that performance is usually within 10% of current multipass SYCL compiler techniques, with the maximum deviations ranging from a performance regression of 13% to a speedup of 27%. This implies that compared to current SYCL compilation techniques, our new compiler achieves similar performance while substantially decreasing compile times, and increasing the portability of generated binaries.