International Workshop on OpenCL最新文献_第5页

Embedding a DSL in SYCL for Productive and Performant Tensor Computing on Heterogeneous Devices 在SYCL中嵌入DSL以实现异构设备上的高效张量计算

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529988

Abenezer Wudenhe, Hongbo Rong

引用次数: 0

Exploring the possibility of a hipSYCL-based implementation of oneAPI 探索基于hipsycl的oneAPI实现的可能性

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530005

Aksel Alpay, Bálint Soproni, Holger Wünsche, Vincent Heuveline

{"title":"Exploring the possibility of a hipSYCL-based implementation of oneAPI","authors":"Aksel Alpay, Bálint Soproni, Holger Wünsche, Vincent Heuveline","doi":"10.1145/3529538.3530005","DOIUrl":"https://doi.org/10.1145/3529538.3530005","url":null,"abstract":"oneAPI is an open standard for a software platform built around SYCL 2020 and accelerated libraries such as oneMKL as well as low-level building blocks such as oneAPI Level Zero. All oneAPI implementations currently are based on the DPC++ SYCL implementation. However, being able to utilize multiple independent SYCL implementations with oneAPI code can be beneficial to both users and implementors when it comes to testing code, or e.g. noticing ambiguities in the specification. In this work, we explore the possibility of implementing oneAPI using hipSYCL as an independent SYCL implementation instead. We review hipSYCL’s design and demonstrate it running on oneAPI Level Zero with competitive performance. We also discuss hipSYCL’s support for SYCL 2020 with the examples of unified shared memory (USM), group algorithms and optional kernel lambda naming. To this end, we also contribute microbenchmarks for the SYCL 2020 group algorithms and demonstrate their performance. When testing hipSYCL with HeCBench, a large benchmark suite containing SYCL benchmarks initially developed for DPC++, we point out specification ambiguities and practices that negatively impact code portability when transitioning from DPC++ to hipSYCL. We find that we can compile 122 benchmarks with little effort with hipSYCL, and demonstrate performance for a selection of benchmarks within 20% of native models on NVIDIA and AMD GPUs. Lastly, we demonstrate oneMKL’s BLAS domain running with hipSYCL on AMD and NVIDIA GPUs, and find that it can match native cuBLAS and rocBLAS performance for BLAS level 1, level 2 and level 3 operations, while significantly outperforming oneMKL with DPC++ on NVIDIA GPUs for all but the largest problem sizes. Overall, we find that hipSYCL can support low-level building blocks like Level Zero, oneAPI libraries like oneMKL, and the SYCL 2020 programming model efficiently, and hence conclude that it is indeed possible to implement oneAPI independently from DPC++.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91109251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Optimize AI pipelines with SYCL and OpenVINO 优化人工智能管道与SYCL和OpenVINO

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529561

Nico Galoppo

{"title":"Optimize AI pipelines with SYCL and OpenVINO","authors":"Nico Galoppo","doi":"10.1145/3529538.3529561","DOIUrl":"https://doi.org/10.1145/3529538.3529561","url":null,"abstract":"Sensor data processing pipelines that are a ”mix” of feature-engineered and deep learning based processing have become prevalent today. For example, sensor fusion of point cloud data with RGB image streams is common in autonomous mobile robots and self-driving technology. The state-of-the-art in computer vision for extracting semantic information from RGB data is using deep learning today, and great advancements have been made recently in LiDAR odometry based on deep learning [x]. At the same time, other processing components in ”mixed” pipelines still use feature-engineered approaches that are not relying on deep neural nets. Embedded compute platforms in robotics systems are inherently heterogeneous in nature, often with a variety of CPUs, (integrated) GPUs, VPUs, and so on. This means that there is a growing need to implement ”mixed” pipelines on heterogeneous platforms that include a variety of xPUs. We want such pipeline implementations to benefit from the latest advancements in data- and thread-parallel computation, as well as state-of-the-art in optimized inference of AI DNN models. SYCL and OpenVINO are two open, industry supported APIs that allow a developer to do so. It is not only important to optimize the individual components of the processing pipeline - it is at least as important to also optimize the data flow and minimize data copies. This provides a way to benefit from the efficiencies in inference runtime and compute graph optimizations provided by OpenVINO, in combination with the extensibility that SYCL brings in implementing custom or non-DNN components. Similarly, the use of compatible synchronization primitives allows the different runtimes to schedule work more efficiently on the hardware and avoid execution hiccups. In this talk, we will demonstrate the mechanisms and primitives provided by both SYCL and OpenVINO to optimize the dataflow between, and efficient execution of the workloads implemented in the respective APIs. We will provide an example and show the impact on the overall throughput and latency of the end-to-end processing pipeline. The audience will learn to recognize inefficiencies in their pipelines using profiling tools, and understand how to optimize those inefficiencies using an easy-to-follow optimization recipe. Finally, we will provide guidance to developers of inference engines other than OpenVINO on how to integrate similar interoperability features into their APIs, so that they too can offer optimized SYCL-enabled AI pipelines to their users.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78305996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TAU Performance System TAU绩效系统

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529557

S. Shende

引用次数: 0

OpenCLML Integration with TVM OpenCLML与TVM的集成

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530003

Siva Rama Krishna Reddy, Hongqiang Wang, Alex Bourd, Adarsh Golikeri, Balaji Calidas

引用次数: 1

How to optimize Compute Drivers? Let’s start with writing good benchmarks! 如何优化计算驱动程序?让我们从编写好的基准开始吧!

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529569

Michał Mrozek

{"title":"How to optimize Compute Drivers? Let’s start with writing good benchmarks!","authors":"Michał Mrozek","doi":"10.1145/3529538.3529569","DOIUrl":"https://doi.org/10.1145/3529538.3529569","url":null,"abstract":"Writing efficient driver stack is the goal of every driver developer, but to see if your stack is performant, you need tools that will confirm this. You may try to run workloads and benchmarks and see how your driver perform, but this will only give you a summarized score, consisting of many pieces. To further optimize this, you need to take extensive steps in understanding the applications, figuring out what is the bottleneck and optimizing it, is quite time-consuming process involving a lot of effort. This created a need for driver team to write a tool, that would make performance work on the driver easier, so we created compute benchmarks. In this suite we test all aspects of driver stack to see if they do not have any bottlenecks. Each test checks only one thing and does this in isolation, so it is very easy to work on optimizing it and doesn’t require any extensive setup. Benchmarks focus on such subtle aspect of every driver as: API overhead of every call, submission latencies, resource creation costs, transfer bandwidths, multi-threaded contention, multi process execution and many others. Framework offers capabilities for multiple backends, currently we have OpenCL and Level Zero implementations in place, so it is very easy to compare how the same scenario is services with different drivers. It is also very easy to compare driver implementations between vendors, as tests written in OpenCL simply work across different GPU implementations. We also use this code to present good and bad coding practices, this is very useful to showcase how simple things can drastically improve performance and users can simply run those scenarios and see how performance changes on their own setups. It is also a great tool to prototype new extensions and further propose them as a part of OpenCL standard. We plan to Open Source this project in Q2 2022, it is expected to be already available during IWOCL.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88400444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improved address space inference for SYCL programs 改进了SYCL程序的地址空间推断

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529998

Ross Brunton, V. Lomüller

{"title":"Improved address space inference for SYCL programs","authors":"Ross Brunton, V. Lomüller","doi":"10.1145/3529538.3529998","DOIUrl":"https://doi.org/10.1145/3529538.3529998","url":null,"abstract":"SYCL[4, 6] is a single source C++ based programming model for heterogeneous programming. It enables the programmer to write or port code targeting heterogeneous accelerators using what appears to the programmer as standard C++. To achieve peak performance, however, it can be necessary to write the code in a form which allows the compiler to target specific hardware features. If the compiler can target these hardware features without requiring the programmer to consider them, then productivity and application performance can both be improved. One such example is accelerators with multiple address spaces, this technical talk will describe how a SYCL compiler can infer these address spaces without requiring the programmer to specify them in their application as well as describe some required specification evolution in order to better cope with the new SYCL 2020 features. Hardware devices can have multiple memory regions with different levels of visibility and performance. Similar to OpenCL C[5], SYCL abstracts them into a global memory visible to all work-items, a local memory visible to a single work-group, or a private memory only visible to a single work-item. In OpenCL C, the programmer expresses address spaces using type qualifiers in order to statically encode the memory region addressed by pointers thus ensuring that when a programmer does specify an address space the compiler can check whether the program is well-formed. But requiring programs to be written with explicit address spaces comes at the expense of usability, as these need to be integrated into the program design and are a barrier to integrate code not written with this in mind. Thus in OpenCL C 2.x/3 programmers can make use of the unnamed generic address space instead. On the other hand, SYCL does not extend the C++ language therefore programmers cannot express address spaces using a type qualifier (as the C++ standard does not define them). Thus in SYCL pointers and references can be lowered to this unnamed generic address space by the device compiler. This generic address space is a virtual address space that can represent several overlapping address spaces at the same time. The memory being addressed is no longer statically known by the compiler frontend and the SYCL implementation relies on the hardware, or software emulation, to correctly dispatch the loads and stores to the correct memory. On some hardware targets this flexibility comes with a performance cost, but this can be avoided when the compiler can infer a single address space for a given memory access. Additionally, the low-level compute APIs that are often used as backends to a SYCL 2020 implementation do not guarantee support for a generic address space, e.g. they are an optional feature in OpenCL 3.0 and non-existent in Vulkan. This means that a SYCL compiler that can infer all address spaces for a large set of programs can achieve better performance and target a wider range of backend compute APIs. Moreover, r","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"136 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77364467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

C++ for OpenCL 2021 c++ for OpenCL 2021

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529981

Justas Janickas, Anastasia Stulova

引用次数: 0

SYCL Concurrency on GPU Platforms: Empirical Measurement GPU平台上的SYCL并发性:经验测量

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529989

T. Applencourt, Abhishek Bagusetty, Ajay Panyala, Aksel Alpay

引用次数: 0

How much SYCL does a compiler need? Experiences from the implementation of SYCL as a library for nvc++ 编译器需要多少SYCL ?SYCL作为nvc++库的实现经验

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529556

Aksel Alpay, V. Heuveline

引用次数: 1