Proceedings of the 2023 International Workshop on OpenCL最新文献_第3页

Machine Learning for Vectorization Decision in OpenCL/SYCL Kernel OpenCL/SYCL内核中矢量化决策的机器学习

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585364

Wenju He, Yuxin Zou, Feng Zou

引用次数: 0

What’s New in SYCL for Safety Critical Systems 安全关键系统的SYCL新功能

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585367

Erik Tomusk, Verena Beckham

{"title":"What’s New in SYCL for Safety Critical Systems","authors":"Erik Tomusk, Verena Beckham","doi":"10.1145/3585341.3585367","DOIUrl":"https://doi.org/10.1145/3585341.3585367","url":null,"abstract":"In April 2022, Codeplay and CoreAVI initiated the SYCL SC Exploratory Forum within Khronos to evaluate industry interest in a new Khronos API based on SYCL and targeted at safety-critical industries[1]. A year later, we take stock of the progress the Exploratory Forum has made on defining SYCL for Safety-Critical Systems, and we share some of the insights we have gained. Safety-critical industries, like avionics, automotive, nuclear, and rail, require their software to be compliant to safety standards such as ISO 26262, ISO 21448/SOTIF, DO-178C, and UL4600, as well as to adhere to guidelines such as those defined by AUTOSAR and MISRA. While safety-critical industries have traditionally been cautious about adopting new or unproven technologies, interest by these industries in C++ and heterogeneous programming has increased significantly in recent years. This is driven, in large part, by the need for AI technologies to implement advanced features, such as autonomous behavior. Compute-heavy workloads like AI require high-level programming frameworks as well as considerable computing power, which can only be achieved by a heterogeneous system design. SYCL’s single-source C++ programming model has already become popular in the HPC industry. The proposed SYCL for Safety-Critical Systems API aims to open up high-level heterogeneous compute to safety-critical industries by introducing modifications and extensions to SYCL to make both SYCL applications and SYCL implementations easier to certify to industry safety standards. In this talk, we will give an overview of what certification to a safety standard implies for a compiler and runtime based on SYCL. Khronos Exploratory Forums are designed to be open to companies and individuals who are not yet Khronos participants. A key aim of the SYCL SC Exploratory Forum was to hear from “end-user” companies in safety-critical domains, and to evaluate the market for a safety-critical API based on SYCL. The talk will give an overview of the companies that participated and their general feedback. In the initial phase, the SYCL SC Exploratory Forum heard presentations from its participants and collated a “wish list” of features for a high-level heterogeneous compute API. The talk will give an overview of features that were requested and a discussion of some of the more interesting points. In the second stage, the members of the Forum analyzed these “wishes” according to their relevance to a safety-critical standard specifically based on SYCL. A list of core requirements for the SYCL for Safety-Critical Systems API was distilled from the wish list and will act as a guide during the definition of the new standard. The talk will include an overview of the requirements, background on the finer technical points, and some of the technical discussions that were had around these topics. The presentation will also describe some of the open questions that are still to be answered during the design of the SYCL for Safety-Critica","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116897012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Standardizing complex numbers in SYCL SYCL中复数的标准化

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585343

T. Applencourt, B. Videau, Jefferson Le Quellec, Amanda Dufek, K. Harms, N. Liber, Bryce Allen, Aiden Belton-Schure

引用次数: 1

Streamline Ahead-of-Time SYCL CPU Device Implementation through Bypassing SPIR-V 通过绕过SPIR-V，简化提前SYCL CPU设备实现

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585381

Wenju He, Yilong Guo, Xinmin Tian, Hideki Saito, Wenwan Xing, Feng Zou, Chunyang Dai, Maosu Zhao, Haonan Yang

引用次数: 0

Porting SYCL accelerated neural network frameworks to edge devices 将SYCL加速的神经网络框架移植到边缘设备

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585346

Dylan Angus, S. Georgiev, Hector Arroyo Gonzalez, J. Riordan, P. Keir, M. Goli

{"title":"Porting SYCL accelerated neural network frameworks to edge devices","authors":"Dylan Angus, S. Georgiev, Hector Arroyo Gonzalez, J. Riordan, P. Keir, M. Goli","doi":"10.1145/3585341.3585346","DOIUrl":"https://doi.org/10.1145/3585341.3585346","url":null,"abstract":"Portable hardware acceleration has become increasingly necessary with the rise of the popularity of edge computing. Edge computing, referring to the distributed computing paradigm that encourages data to be processed and stored as close to the source of origination as possible, is needed in areas where bandwidth and latency are restricted and network stability, privacy, or security are unreliable or insecure. Examples of such situations are autonomous mobile robotics, such as autonomous tractors, which often have numerous cameras connected to the host, all needing processing in areas where there can be no reliable connection to a cloud-based platform. Additionally, bridge surveying drones, where mapping and path-planning are needed with low latency, can benefit from a lightweight, compact, low-powered device, especially when there are size and energy consumption requirements. Thus, edge devices, which work as small but compact computers, leverage onboard accelerators to tackle various Robotics, Computer Vision and AI tasks directly on the device without needing an external connection. These accelerators often take the popular form of a GPU like Nvidia’s Jetson development kit series, which are driven by the same workflows of Nvidia’s AI software and cloud-native frameworks while staying lean, compact and less energy-demanding. However, with the increasing popularity of FPGAs, in the future we could see more edge devices like AMD and Xilinx’s KR260 robotics development kit, that operate at low power. Hence, with the surge of the usefulness of edge devices and variety in the brand and type of accelerators, the need for hardware portability in edge devices expands as well. Thus, as we will show in this talk, SYCL as an open-standard, high-level parallel programming model which provides portability not only at the API level but also at the compiler level provides this hardware portability by enabling the same software to be run on both CPU, GPU and FPGA-based edge devices. Additionally, we will show how we maintain performance through device-specific kernel specialisation. The Open Neural Network Exchange (ONNX) is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and software tools. ONNX is available on GitHub. This presentation will explain how we used DPC++, an open source SYCL implementation, to compile the SYCL backend of the ONNX runtime, to target NVIDIA’s Jetson series architecture. DPC++ allows us to compile for the ONNX runtime SYCL backend and use the Jetson’s onboard GPU and also use ComputeAorta, Codeplay’s multi-target, multi-platform framework, as an OpenCL implementation to target the Jetson’s onboard CPU. We will show the performance we get using the ONNX runtime CPU backend and the SYCL backend targeting Jetson’s GPU and CPU. The ONNX runtime SYCL backend is implemented using the lightweight templated SYCL-BLA","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116487539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparing the Performance of SYCL Runtimes for Molecular Dynamics Applications 分子动力学应用中SYCL运行时的性能比较

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585350

Andrey Alekseenko, Szilárd Páll

{"title":"Comparing the Performance of SYCL Runtimes for Molecular Dynamics Applications","authors":"Andrey Alekseenko, Szilárd Páll","doi":"10.1145/3585341.3585350","DOIUrl":"https://doi.org/10.1145/3585341.3585350","url":null,"abstract":"SYCL is a cross-platform, royalty-free standard for programming a wide range of hardware accelerators. It is a powerful and convenient way to write standard C++ 17 code that can take full advantage of available devices. There are already multiple SYCL implementations targeting a wide range of platforms, from embedded to HPC clusters. Since several implementations can target the same hardware, application developers and users must know how to choose the most fitting runtime for their needs. In this talk, we will compare the runtime performance of two major SYCL runtimes targeting GPUs, oneAPI DPC++ and Open SYCL [3], to the native implementations for the purposes of GROMACS, a high-performance molecular dynamics engine. Molecular dynamics (MD) applications were one of the earliest adopters of GPU acceleration, with force calculations being an obvious target for offloading. It is an iterative algorithm where, in its most basic form, on each step, forces acting between particles are computed, and then the equations of motions are integrated. As the computational power of the GPUs grew, the strong scaling problem became apparent: the biophysical systems modeled with molecular dynamics typically have fixed sizes, and the goal is to perform more time steps, each taking less than a millisecond of wall time. This places high demands on the underlying GPU framework, requiring it to efficiently schedule multiple small tasks with minimal overhead, allowing to achieve overlap between CPU and GPU work for large systems and allowing to keep GPU occupied for smaller systems. Another requirement is the ability of application developers to have control over the scheduling to optimize for external dependencies, such as MPI communication. GROMACS is a widely-used MD engine, supporting a wide range of hardware and software platforms, from laptops to the largest supercomputers [1]. Portability and performance across multiple architectures have always been one of the primary goals of the project, necessary to keep the code not only efficient but also maintainable. The initial support for NVIDIA accelerators, using CUDA, was added to GROMACS in 2010. Since then, heterogeneous parallelization has been a major target for performance optimization, not limited to NVIDIA devices but later adding support for GPUs of other vendors, as well as Xeon Phi accelerators. GROMACS initially adopted SYCL in its 2021 release to replace its previous GPU portability layer, OpenCL [2]. In further releases, the number of offloading modes supported by the SYCL backend steadily increased. As of GROMACS 2023, SYCL support in GROMACS achieved near feature parity with CUDA while allowing the use of a single code to target the GPUs of all three major vendors with minimal specialization. While this clearly supports the portability promise of modern SYCL implementations, the performance of such portable code remains an open question, especially given the strict requirements of MD algorithms. In th","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129390275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Parallel Algorithm for a Hidden Markov Model with an Indefinite Number of States and Heterogeneous Observation Data 具有不定状态数和异构观测数据的隐马尔可夫模型并行算法

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3587954

V. Roubtsova

引用次数: 0

Experiences Migrating CUDA to SYCL: A Molecular Docking Case Study CUDA向SYCL迁移的经验:一个分子对接案例研究

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585372

Leonardo Solis-Vasquez, E. Mascarenhas, Andreas Koch

引用次数: 4

Technical Talk: A SYCL Extension for User-Driven Online Kernel Fusion 技术讲座:用户驱动在线内核融合的SYCL扩展

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585377

Victor Perez, Lukas Sommer, Victor Lomüler, Kumudha Narasimhan, M. Goli

{"title":"Technical Talk: A SYCL Extension for User-Driven Online Kernel Fusion","authors":"Victor Perez, Lukas Sommer, Victor Lomüler, Kumudha Narasimhan, M. Goli","doi":"10.1145/3585341.3585377","DOIUrl":"https://doi.org/10.1145/3585341.3585377","url":null,"abstract":"Heterogeneous programming models such as SYCL allow developers to integrate a variety of accelerators found in today’s heterogeneous systems into an application with ease. However, while offloading specific tasks to specialized accelerators can deliver significant performance improvements for many applications, short-running device kernels remain a challenge for most heterogeneous programming models. Each invocation of a device kernel is linked to some overhead, caused by the necessary data-transfers, kernel launch and synchronization between host and device. In particular, for a sequence of short-running kernels, this can lead to an unfavourable ratio of overhead and actual computation, resulting in performance degradation. One potential solution to address this problem is to merge multiple small, memory-bound, short-running kernels into a single larger kernel. This leads to better use of the device’s resources and amortizes the device launch overhead. Yet, manually creating fused kernels can be an error-prone, challenging task for developers, and the resulting kernels are less reusable and maintainable. The extension to the SYCL API presented in this talk aims to automate the creation of fused kernels. It provides a mechanism for users or software frameworks using SYCL to instruct the runtime to automatically fuse multiple device kernels at runtime, without the need for manual implementation of the fused kernel. Users or software frameworks can use their application and domain knowledge, as well as runtime context information, to determine when fusion of kernels is legal and profitable, while the actual process of creating a fused kernel is automated by the SYCL runtime. Reducing the kernel launch overhead is however not the only way kernel fusion can improve application performance. The LLVM-based JIT compiler integrated into the SYCL runtime implementation for automatic creation of fused kernels can perform further optimizations. One such optimization is the internalization of dataflow. Intermediate results that originally needed to be communicated via global memory between the different kernels now become internal dataflow of the fused kernel. Replacing slow global memory accesses for this internalized dataflow with faster accesses to local memory or even registers can yield significant performance improvements for many applications. The extension presented in this talk is currently an experimental vendor extension, targeting SYCL version 2020. The initial proof-of-concept implementation was based on Codeplay’s ComputeCpp SYCL implementation and has also been contributed and open-sourced as part of the DPC++ SYCL implementation. To demonstrate the performance improvements unlocked by the extension, two different types of workloads are evaluated on Intel CPU and integrated Intel GPUs. For a set of sixteen typical operator sequences from neural networks with various input sizes, kernel fusion achieves speedups between 0.9x and 2.26x on GPU (ge","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122313418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SYCLomatic compatibility library: making migration to SYCL easier SYCL兼容性库:使迁移到SYCL更容易

Proceedings of the 2023 International Workshop on OpenCL Pub Date : 2023-04-18 DOI: 10.1145/3585341.3585349

Andy Huang

{"title":"SYCLomatic compatibility library: making migration to SYCL easier","authors":"Andy Huang","doi":"10.1145/3585341.3585349","DOIUrl":"https://doi.org/10.1145/3585341.3585349","url":null,"abstract":"SYCL[1] is a royalty-free, cross-platform abstraction C++ programming model for heterogeneous computing. SYCL provides necessary programming interfaces like device, queue, kernel, memory interface including buffer, accessor as well as features like USM. As a programing model for heterogeneous computing, Intel oneAPI[2] provides a SYCL compiler and runtime to support SYCL kernel-based programing and set of optimized libraries to support API-based programming. SYCLomatic[3] is a project to assist developers in migrating their existing code written in different programming languages to the SYCL C++ heterogeneous programming model. SYCLomatic supports source-to-source migration from existing CUDA application source code to SYCL source code by leveraging SYCL interfaces and the optimized libraries provided by Intel oneAPI. One of the major challenges of SYCLomatic is that, in some cases, due to differences in API, expressing the identical semantic of a single line of CUDA code in SYCL requires additional data structures or multiple lines of operations. To assist the migration and make the migrated code performant and maintainable, SYCLomatic implements a compatibility library, which consists of additions to SYCL interfaces and a set of compatible APIs for popular libraries. Without the dependency to SYCLomatic, the compatibility library can be used as a standalone library for SYCL programming. In this talk, we are going to share the reason of creating the compatibility library and the design of the compatibility library. Addressing Semantic Differences: The first part of the compatibility library is to address the semantic differences with CUDA code by adding new functionality to SYCL interfaces like device, queue, malloc, image accessor, etc. by introducing new classes. (1) Utility features to access queues in different devices and threads: Keeping and passing around the sycl::device pointer between host functions is tedious. In the compatibility library, a singleton device manager class is introduced and used to track the usage of each device in different CPU threads. With the device manager class, it is easy to achieve following features: (a) Get the “current” device in a thread: The class keeps a map between threads and the last used device in the thread. The map makes it easier to access the wanted device in a host function. (b) Get the default queue for a device: When offloading a task to a device, SYCL requires developer to create a new queue on the device if the pointer of previous created queue is not available. The class keeps a default queue for each device which will be available globally. When a developer needs to use the queue on a device, the class provides a convenient interface to get the default queue of the device. (c) Device level operation (create queue, synchronize, reset): The class records all the creation of queues and maps the queues to the devices. Therefore, device level synchronization can be achieved easily. (2) Pointer-l","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127582503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2