Proceedings of the 2023 International Workshop on OpenCL最新文献

Evaluation of SYCL’s Suitability for High-Performance Critical Systems SYCL在高性能关键系统中的适用性评估

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585378

Cristina Quesada Peralta, Matina Maria Trompouki, Leonidas Kosmidis

{"title":"Evaluation of SYCL’s Suitability for High-Performance Critical Systems","authors":"Cristina Quesada Peralta, Matina Maria Trompouki, Leonidas Kosmidis","doi":"10.1145/3585341.3585378","DOIUrl":"https://doi.org/10.1145/3585341.3585378","url":null,"abstract":"Upcoming safety critical systems require high performance processing, which can be provided by multi-cores and embedded GPUs found in several Systems-on-chip (SoC) targeting these domains. So far, only low-level programming models and APIs, such as CUDA or OpenCL have been evaluated. In this paper, we evaluate the effectiveness of a higher level programming model, SYCL, for critical applications executed in such embedded platforms. In particular, we are interested in two aspects: performance and programmability. In order to conduct our study, we use the open source GPU4S Bench benchmarking suite for space and an open source pedestrian detection application representing the automotive sector, which we port into SYCL and analyse their behavior. We perform our evaluation on a high-performance platform featuring an NVIDIA GTX 1080Ti as well as a representative embedded platform, the NVIDIA Xavier AGX which is considered a good candidate for future safety critical systems in both domains and we compare our results with other programming models. Our results show that in several cases SYCL is able to obtain performance close to highly optimised code using CUDA or NVIDIA libraries, with significantly lower development effort and complexity, which confirms the suitability of SYCL for programming high-performance safety critical systems.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127538382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Edge Acceleration for Machine Learning-based Motion Artifact Detection on fNIRS Dataset 基于机器学习的fNIRS数据集运动伪影检测边缘加速

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585380

Yunyi Zhao, Yunjia Xia, Rui Loureiro, Hubin Zhao, Uwe Dolinsky, Shufan Yang

引用次数: 0

Accelerating Simulink/Matlab projects with SYCL 使用SYCL加速Simulink/Matlab项目

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585363

Uwe Dolinsky

{"title":"Accelerating Simulink/Matlab projects with SYCL","authors":"Uwe Dolinsky","doi":"10.1145/3585341.3585363","DOIUrl":"https://doi.org/10.1145/3585341.3585363","url":null,"abstract":"Matlab/Simulink is a very popular software development tool which is widely used in academia and industry to build and evaluate complex dynamical systems. It combines graphical modelling with the ability to develop algorithms directly in the Matlab language and offers various toolboxes targeting a wide range of applications in automotive, robotics, image processing, machine learning and other areas. For deployment on various platforms the Matlab/Simulink projects are typically translated into C/C++ by MathWorks’ Simulink/Matlab Coder tool and subsequently built by the C/C++ toolchain for the targeted platform. In this talk we present a new tool flow to accelerate Simulink/Matlab projects with SYCL. This enables Matlab/Simulink projects to take advantage of the growing open-source SYCL ecosystem to accelerate complex Simulink models on a wide range of diverse platforms in a standards-based way. This enables Matlab/Simulink projects to directly benefit from performance-optimized SYCL algorithms and tools to tune the performance of Simulink models on different hardware. The presented tool flow does not require Matlab/Simulink to be installed and no Matlab/Simulink dependencies are required. The approach is non-disruptive in that the Simulink/Matlab developers do not need to know SYCL and do not need to adapt Simulink/Matlab solutions to take advantage of SYCL. In this work Simulink/Matlab solutions are translated with the help of open-source tools into C++ code calling an API which can be customized to use different algorithms and libraries as backends. For example, vector/matrix operations are typically performed in this flow by the Eigen library while scalar operations are performed by the standard C++ library. API functions for specific Matlab/Simulink operations or operands can be implemented or overloaded/specialized to use different libraries if needed. For example operations on large matrices or vectors can be implemented using SYCL-BLAS to take advantage of highly parallel linear algebra operations that can be autotuned to maximize performance for a given platform. The presented tool flow converts entire Simulink solutions into C++ code. It does so by reading in Simulink solutions files (in.slx or.mdl format) and extracting the models and their associated sub models and blocks. It also reads in and integrates data files (*.mat files), data dictionary files, and Matlab files to initialize the workspace. The blocks are then scheduled and translated into C++ code constituting the model step which is run to execute the model. Models can contain sub models, model references and sub systems – which can contain embedded Matlab files that are executed when the associated blocks are executed. These Matlab files are converted into C++ code and integrated with the code generated from the containing Simulink block. The tool flow solved various challenges when converting Matlab code into C++, and supports most Matlab language features. The presented tool","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116462311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance Evolution of Different SYCL Implementations based on the Parallel Least Squares Support Vector Machine Library 基于并行最小二乘支持向量机库的不同SYCL实现的性能演化

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585369

Marcel Breyer, Alexander Van Craen, D. Pflüger

{"title":"Performance Evolution of Different SYCL Implementations based on the Parallel Least Squares Support Vector Machine Library","authors":"Marcel Breyer, Alexander Van Craen, D. Pflüger","doi":"10.1145/3585341.3585369","DOIUrl":"https://doi.org/10.1145/3585341.3585369","url":null,"abstract":"In machine learning and scientific computing, some of the biggest challenges are efficient and performant portable computing. With our Parallel Least Squares Support Vector Machine (PLSSVM) library, we have not only developed an unrivaled Support Vector Machine (SVM) implementation for huge dense data sets, but we have also created a representative benchmark for a frequently encountered task in scientific computing, a (implicit) matrix-vector multiplication. PLSSVM supports multiple backends—OpenMP, CUDA, HIP, OpenCL, and SYCL—to be able to target the most widely used hardware platforms in machine learning and scientific computing. In this paper, we use PLSSVM to compare different DPC++ and Open SYCL (formerly known as hipSYCL) versions over the period of one year. Furthermore, we compared two versions (one from February and the other from November 2022) with each other and report their respective performance evolution in depth. We also put these results in relation to our other implemented backends and report their performance portability on three different hardware platforms, an NVIDIA and AMD GPU and an Intel CPU. Our results show that installing new DPC++ and Open SYCL versions can have surprisingly vast impacts in both directions. In our case, the nd_range kernel runtimes were up to faster on an NVIDIA GPU when using a newer DPC++ compiler. Also for Open SYCL, using the new omp.accelerated compilation flow improves the nd_range performance on CPUs by over . When compared to OpenCL, in our results, SYCL also offers a better performance portability while being easier to use, indicated by drastically fewer lines of code needed in our PLSSVM library. While OpenCL only has a performance portability of , DPC++ achieved the highest value with within the performance metric provided by Pennycook et al. [23]. The code, utility scripts, and documentation are all publicly available on GitHub: https://github.com/SC-SGS/PLSSVM.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128480494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

VkFFT and beyond - a platform for runtime GPU code generation VkFFT和超越-运行时GPU代码生成的平台

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585357

Dmitrii Tolmachev

{"title":"VkFFT and beyond - a platform for runtime GPU code generation","authors":"Dmitrii Tolmachev","doi":"10.1145/3585341.3585357","DOIUrl":"https://doi.org/10.1145/3585341.3585357","url":null,"abstract":"This talk will present the VkFFT version 1.3 and the new platform for runtime GPU code generation it is based on. The main reason for this update is to make algorithms implemented in VkFFT available for many other GPU applications and standardize the way the code is generated in it. The platform presented allows fine-tuning of the algorithms for a particular GPU and API they are executed on at runtime. It aims to make it easier for competent GPU programmers to express themselves to different APIs, as the design logic of modern GPUs is fairly similar between all vendors. This is the main difference between the platform and other existing API-independent ways to write code, as they usually aim at fast prototyping and simple optimizations under the hood for beginner-level GPU programmers. The platform has a hierarchical structure design: Application -> Plan -> Code. At the application stage, the platform performs all interactions with the user and resources management. This includes configuration parsing, calls to the application initialization, update, dispatch and deletion with optional binary caching. The plan stage is the internal configuration stage that constructs the intermediate representation of the problem to be solved. This includes all algorithm decision-making, resource allocation, calls to the code generator and code compilation. The code generation stage produces a string that will hold GPU code for a particular API that can be later compiled and used. It is further divided into multiple levels: level 2 subkernels – a clear description of the problem via a sequence of calls to lower levels; level 1 subkernels – simple routines: matrix-vector multiplication, FFT, pre- and post-processing, R2C/R2R mappings; level 0 subkernels – memory management, basic math, functions inlining, API-dependent definitions. The code generator operates on special data containers, that can hold either known during the plan creation integer/float values or strings of variable names. Using a multiplication operation that performs A=B*C as an example, if all containers have known values, A can be precomputed during plan creation. If A, B and C are register names, we print to the kernel an operation of multiplication to be executed. This talk will also discuss multiple algorithms implemented with this platform. On the example of VkFFT we will demonstrate the overall platform structure and the general GPU application design guidelines, mainly related to optimization of memory layout, such as having no CPU-GPU transfers during execution except for asynchronous downloads from the GPU, minimization of GPU dedicated memory-L2-L1 communication and maximization of on-chip memory usage. To go even further, we will demonstrate how a finite difference solver can be implemented with a help of the platform using only low-level warp shuffling instructions to perform on-chip data transfers instead of using the shared memory of the streaming multiprocessor (on-chip memory acce","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122825324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Deferred Execution of a SYCL Command Graph 关于SYCL命令图的延迟执行

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585375

Ewan W. Crawford, Pablo Reble, Ben Tracy, Julian Miller

{"title":"Towards Deferred Execution of a SYCL Command Graph","authors":"Ewan W. Crawford, Pablo Reble, Ben Tracy, Julian Miller","doi":"10.1145/3585341.3585375","DOIUrl":"https://doi.org/10.1145/3585341.3585375","url":null,"abstract":"A key concept in SYCL’s execution model is the use of command groups that create a directed acyclic graph of kernel executions at runtime. A command group object defines a set of dependencies or edges that must be satisfied for kernels or nodes to be executed. However, because command group submission is tied to execution on the queue, without having a prior construction step before starting execution, optimization opportunities can be missed from the runtime not being made aware of a defined dependency graph ahead of execution. This represents de facto a built-in eager execution mode in SYCL in contrast to a lazy execution mode where definition and submission of work is decoupled. We propose an extension to the SYCL 2020 specification [6], which closes this gap by introducing the concept of a command graph. We add new mechanisms for the user to build a command graph for later execution. Commands are added to a graph, finalized to prepare for execution, and finally executed on a queue. The extension decouples overhead associated with submission by performing expensive operations and optimizations at finalize time and allowing for batching of commands at submission time. This command batching is supported by many SYCL backends but not exposed to users through the SYCL API. In addition to the benefits to the SYCL runtime, there are also advantages to the user developing SYCL applications. Repetitive workloads no longer must redundantly issue the same sequence of commands. Instead, a graph is only constructed once and submitted for execution as many times as is necessary, only changing the data in input buffers or USM (Unified Shared Memory) allocations. For applications from specific domains, such as machine learning as well as computer vision, where the same command group pattern is run repeatedly for different inputs, this is particularly useful. This talk is presented in two sections. First, we provide an overview of the specification for the extension. This includes two distinct mechanisms for graph building: An explicit API that provides a new set of functions for expressing a command graph directly in SYCL code, and the “Record & Replay” API that is designed to retrofit existing codebases and enable the use of existing libraries and frameworks with minor modifications. We discuss the mechanisms available for modifying a graph after construction and the motivation for the API design compared to other similar mechanisms in use today in other programming models. In the second section of our talk, we detail the work in progress for implementing the extension in Intel’s DPC++ runtime, in particular the early-stage prototype [3]. We will show execution traces demonstrating the potential overhead reduction that is possible, as well as current limitations, and what we’ve learned from implementing it so far. This includes an overview of how our implementation maps to the various backends available and how to address situations where there is no backen","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134238192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging MLIR for Better SYCL Compilation (Poster) 利用MLIR优化SYCL编译(海报)

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585379

Victor Perez, Ettore Tiotto, Whitney Tsang, Arnamoy Bhattacharyya, Lukas Sommer, V. Lomüller, Jefferson Le Quellec, James C. Brodman

{"title":"Leveraging MLIR for Better SYCL Compilation (Poster)","authors":"Victor Perez, Ettore Tiotto, Whitney Tsang, Arnamoy Bhattacharyya, Lukas Sommer, V. Lomüller, Jefferson Le Quellec, James C. Brodman","doi":"10.1145/3585341.3585379","DOIUrl":"https://doi.org/10.1145/3585341.3585379","url":null,"abstract":"Recent years have raised awareness of the fact that many optimizing C++ compilers, such as Clang/LLVM, miss optimization opportunities due to the lack of a suitable high-level intermediate representation. The typical compilation flow of such a compiler would lower from a representation close to the original or pre-processed source code, e.g., an abstract syntax tree (AST), directly to a low-level, CFG- and SSA-based intermediate representation such as LLVM IR. However, this lowering loses much of the high-level information and structure of the original source code as it cannot always be represented accurately in the low-level intermediate representation. Compiler optimization passes working on this low-level IR try to recover relevant parts of the high-level information (e.g., loops) and programmer’s intent to optimize the code. If they fail to recover the necessary information, important optimization opportunities might be missed. This insight about loss of high-level information in compilers has driven the creation of the MLIR framework. The MLIR framework, through its core abstraction called “dialect”, enables the creation of a set of multiple intermediate representations capturing high-level semantics and domain-specific information. The progressive lowering process from source code to executable then happens in much smaller steps and allows optimization passes to operate at the appropriate level of abstraction to leverage high-level information. With SYCL being strongly based on C++, SYCL compilers likewise suffer from the same problem as current C++ compilers, which naturally raises the question of whether the MLIR framework can be used to improve SYCL compilation. Our poster will present important insights from our ongoing investigation into this question. It will present an overview of an architecture for an MLIR-based SYCL compiler, demonstrating how MLIR can be integrated into the typical compilation flow for SYCL applications and how the resulting compilation output interacts with existing SYCL runtime implementations. The poster will also report on the status of an ongoing collaboration project between Codeplay and Intel, developing an MLIR-based SYCL compiler as an open-source project based on Intel’s existing DPC++ SYCL compiler and runtime implementation. At the time of writing, the MLIR-based device compiler can already compile a substantial portion of the SYCL application tests in Intel’s fork of the LLVM test-suite, and we seek to further improve coverage and extend the compilation to the host-part of SYCL applications. The design principles and core abstractions of the MLIR dialect for SYCL, developed as part of the project, will be discussed in detail, demonstrating how MLIR enables compiler optimization passes to better understand the semantics of SYCL applications. The poster will outline several opportunities for MLIR to significantly improve the code generated for SYCL applications over existing, LLVM-based compilation flo","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114968752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Implementation Techniques for SPMD Kernels on CPUs SPMD内核在cpu上的实现技术

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585342

Joachim Meyer, Aksel Alpay, Sebastian Hack, H. Fröning, Vincent Heuveline

{"title":"Implementation Techniques for SPMD Kernels on CPUs","authors":"Joachim Meyer, Aksel Alpay, Sebastian Hack, H. Fröning, Vincent Heuveline","doi":"10.1145/3585341.3585342","DOIUrl":"https://doi.org/10.1145/3585341.3585342","url":null,"abstract":"More and more frameworks and simulations are developed using heterogeneous programming models such as OpenCL, SYCL, CUDA, or HIP. A significant hurdle to mapping these models to CPUs in a performance-portable manner is that implementing work-group barriers for such kernels requires providing forward-progress guarantees so that all work-items can reach the barrier. This work provides guidance for implementations of single-program multiple-data (SPMD) programming models, such as OpenCL, SYCL, CUDA, or HIP, on non-SPMD devices, such as CPUs. We discuss the trade-offs of multiple approaches to handling work-group-level barriers. We present our experience with the integration of two known compiler-based approaches for low-overhead work-group synchronization on CPUs. Thereby we discuss a general design flaw in deep loop fission approaches, as used in the popular Portable Computing Language (PoCL) project, that makes them miscompile certain kernels. For our evaluation, we integrate PoCL’s “loopvec” kernel compiler into hipSYCL and implement continuation-based synchronization (CBS) in the same. We compare both against hipSYCL’s library-only fiber implementation using diverse hardware: we use recent AMD Rome and Intel Icelake server CPUs but also two Arm server CPUs, namely Fujitsu’s A64FX and Marvell’s ThunderX2. We show that compiler-based approaches outperform library-only implementations by up to multiple orders of magnitude. Further, we adapt our CBS implementation into PoCL and compare it against its loopvec approach in both, PoCL and hipSYCL. We find that our implementation of CBS, while being more general than PoCL’s approach, gives comparable performance in PoCL and even surpasses it in hipSYCL. Therefore we recommend its use in general.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116820334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Particle track reconstruction on heterogeneous platforms with SYCL 基于SYCL的异构平台粒子轨迹重建

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585344

Bartosz Sobol, G. Korcyl

{"title":"Particle track reconstruction on heterogeneous platforms with SYCL","authors":"Bartosz Sobol, G. Korcyl","doi":"10.1145/3585341.3585344","DOIUrl":"https://doi.org/10.1145/3585341.3585344","url":null,"abstract":"With the SYCL programming model comes the promise of relatively easy parallel and accelerated code development as well as out-of-the-box portability between various hardware platforms from different vendors. One of the areas which can highly benefit from this kind of characteristics of the programming model is particle physics experiments, where large amounts of data need to be processed on multiple stages by a wide variety of algorithms of different profiles. Such a data processing pipeline is often required to consume streaming data from the detectors in an online manner. Modern hardware platforms, accelerators, and their increasing performance are an opportunity for collaborations to collect and analyze more data, more effectively and with better accuracy. On the other hand, building a complex software stack by teams with a limited number of developers becomes more and more challenging in a multi-vendor landscape and with new programming models and APIs emerging. As the physics experiments are designed and computing solutions evaluated many years ahead of the actual run, there is also a need for the codebase of this kind of scientific software to be future-proof, e.g., being able to run on a next-generation computing cluster that uses GPU accelerators from different vendors or entirely different platforms like upcoming powerful APU devices. In this project, we begin with a simple single-threaded implementation of particle track reconstruction algorithm proposed for one of the subdetectors in the PANDA experiment being under development as a part of the FAIR Facility at GSI, Darmstadt, Garmany. We start with a task to port the algorithm to SYCL with minimal effort, I.e., trying to keep the kernel code as close to the original implementation as possible, while attempting to maintain good parallelization and competitive performance in an accelerated environment. After many iterations, experimentation with different memory layouts as well as various approaches to express parallelism and data flow to tame the memory-bounded characteristics of the algorithm, we came up with a final version, that’s still similar in terms of code structure to the original implementation and can achieve satisfying performance across all kinds of different targets. This ultimate implementation, comprising 7 kernels and multiple auxiliary accelerated functions, was evaluated using major SYCL implementations: hipSYCL and DPC++. Benchmarks were conducted on a wide variety of platforms from leading vendors including NVIDIA V100, NVIDIA A100, and AMD MI250 GPUs as well as AMD EPYC Rome and Intel Cascade Lake CPUs, and finally AMD/Xilinx Alveo U280 FPGA accelerator card. For the latter, an experimental AMD/Xilinx compiler based on Intel’s LLVM version was used. We also compare the performance with CUDA implementation built in the same manner as the final SYCL one, showing that it can achieve performance comparable to the native version. We show that developing performant and ","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128976856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

One Pass to Bind Them: The First Single-Pass SYCL Compiler with Unified Code Representation Across Backends 一遍绑定:第一个跨后端统一代码表示的单遍SYCL编译器

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585351

Aksel Alpay, Vincent Heuveline

{"title":"One Pass to Bind Them: The First Single-Pass SYCL Compiler with Unified Code Representation Across Backends","authors":"Aksel Alpay, Vincent Heuveline","doi":"10.1145/3585341.3585351","DOIUrl":"https://doi.org/10.1145/3585341.3585351","url":null,"abstract":"Current SYCL implementations rely on multiple compiler invocations to generate code for host and device, and typically even employ one compiler invocation per required backend code format such as SPIR-V, PTX or amdgcn. This makes generating “universal” binaries that can run on all devices supported by a SYCL implementation very time-consuming, or outright impractical. The ability to generate such universal binaries is however important e.g. when a software vendor wishes to distribute binaries to users that rely on unknown hardware configurations. To address this issue, we present the very first SYCL implementation with a single-source, single compiler pass (SSCP) design and a unified code representation across backends. This allows a single compiler invocation to generate a binary that can execute kernels on all supported devices, dramatically reducing both compile times as well as the user effort required to generate such universal binaries. Our work is publicly available as part of the hipSYCL implementation of SYCL, and supports Intel GPUs through SPIR-V, NVIDIA GPUs through CUDA PTX and AMD GPUs through ROCm amdgcn code. Our new compiler operates in two phases: At compile time, during the regular host compilation pass, it extracts the LLVM IR of kernels. This IR is then stored in a backend-independent fashion in the host binary. At runtime, the embedded LLVM IR is then lowered to the format required by backend drivers (e.g. PTX, SPIR-V, amdgcn). This approach enables portability of a single code representation even if backends do not support a common code format, while still allowing interoperability with vendor-specific optimized libraries. We find that our new compiler can generate highly portable binaries that run on any NVIDIA, Intel or AMD ROCm GPU with only 20% additional compilation time compared to a regular clang host compilation. On our test system, this is roughly 2.2 × faster than compiling with the existing hipSYCL compiler for just three AMD GPUs. We also show that the cost of the additional runtime compilation steps can be expected to be approximately comparable to the cost of runtime compilation that backend drivers already perform today, e.g. to lower SPIR-V to machine code. Lastly, we present early performance results on four different GPUs from three vendors. We find that performance is usually within 10% of current multipass SYCL compiler techniques, with the maximum deviations ranging from a performance regression of 13% to a speedup of 27%. This implies that compared to current SYCL compilation techniques, our new compiler achieves similar performance while substantially decreasing compile times, and increasing the portability of generated binaries.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"1224 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125083790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3