International Workshop on OpenCL最新文献

筛选
英文 中文
Embedding a DSL in SYCL for Productive and Performant Tensor Computing on Heterogeneous Devices 在SYCL中嵌入DSL以实现异构设备上的高效张量计算
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529988
Abenezer Wudenhe, Hongbo Rong
{"title":"Embedding a DSL in SYCL for Productive and Performant Tensor Computing on Heterogeneous Devices","authors":"Abenezer Wudenhe, Hongbo Rong","doi":"10.1145/3529538.3529988","DOIUrl":"https://doi.org/10.1145/3529538.3529988","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78788797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the possibility of a hipSYCL-based implementation of oneAPI 探索基于hipsycl的oneAPI实现的可能性
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530005
Aksel Alpay, Bálint Soproni, Holger Wünsche, Vincent Heuveline
{"title":"Exploring the possibility of a hipSYCL-based implementation of oneAPI","authors":"Aksel Alpay, Bálint Soproni, Holger Wünsche, Vincent Heuveline","doi":"10.1145/3529538.3530005","DOIUrl":"https://doi.org/10.1145/3529538.3530005","url":null,"abstract":"oneAPI is an open standard for a software platform built around SYCL 2020 and accelerated libraries such as oneMKL as well as low-level building blocks such as oneAPI Level Zero. All oneAPI implementations currently are based on the DPC++ SYCL implementation. However, being able to utilize multiple independent SYCL implementations with oneAPI code can be beneficial to both users and implementors when it comes to testing code, or e.g. noticing ambiguities in the specification. In this work, we explore the possibility of implementing oneAPI using hipSYCL as an independent SYCL implementation instead. We review hipSYCL’s design and demonstrate it running on oneAPI Level Zero with competitive performance. We also discuss hipSYCL’s support for SYCL 2020 with the examples of unified shared memory (USM), group algorithms and optional kernel lambda naming. To this end, we also contribute microbenchmarks for the SYCL 2020 group algorithms and demonstrate their performance. When testing hipSYCL with HeCBench, a large benchmark suite containing SYCL benchmarks initially developed for DPC++, we point out specification ambiguities and practices that negatively impact code portability when transitioning from DPC++ to hipSYCL. We find that we can compile 122 benchmarks with little effort with hipSYCL, and demonstrate performance for a selection of benchmarks within 20% of native models on NVIDIA and AMD GPUs. Lastly, we demonstrate oneMKL’s BLAS domain running with hipSYCL on AMD and NVIDIA GPUs, and find that it can match native cuBLAS and rocBLAS performance for BLAS level 1, level 2 and level 3 operations, while significantly outperforming oneMKL with DPC++ on NVIDIA GPUs for all but the largest problem sizes. Overall, we find that hipSYCL can support low-level building blocks like Level Zero, oneAPI libraries like oneMKL, and the SYCL 2020 programming model efficiently, and hence conclude that it is indeed possible to implement oneAPI independently from DPC++.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91109251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Optimize AI pipelines with SYCL and OpenVINO 优化人工智能管道与SYCL和OpenVINO
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529561
Nico Galoppo
{"title":"Optimize AI pipelines with SYCL and OpenVINO","authors":"Nico Galoppo","doi":"10.1145/3529538.3529561","DOIUrl":"https://doi.org/10.1145/3529538.3529561","url":null,"abstract":"Sensor data processing pipelines that are a ”mix” of feature-engineered and deep learning based processing have become prevalent today. For example, sensor fusion of point cloud data with RGB image streams is common in autonomous mobile robots and self-driving technology. The state-of-the-art in computer vision for extracting semantic information from RGB data is using deep learning today, and great advancements have been made recently in LiDAR odometry based on deep learning [x]. At the same time, other processing components in ”mixed” pipelines still use feature-engineered approaches that are not relying on deep neural nets. Embedded compute platforms in robotics systems are inherently heterogeneous in nature, often with a variety of CPUs, (integrated) GPUs, VPUs, and so on. This means that there is a growing need to implement ”mixed” pipelines on heterogeneous platforms that include a variety of xPUs. We want such pipeline implementations to benefit from the latest advancements in data- and thread-parallel computation, as well as state-of-the-art in optimized inference of AI DNN models. SYCL and OpenVINO are two open, industry supported APIs that allow a developer to do so. It is not only important to optimize the individual components of the processing pipeline - it is at least as important to also optimize the data flow and minimize data copies. This provides a way to benefit from the efficiencies in inference runtime and compute graph optimizations provided by OpenVINO, in combination with the extensibility that SYCL brings in implementing custom or non-DNN components. Similarly, the use of compatible synchronization primitives allows the different runtimes to schedule work more efficiently on the hardware and avoid execution hiccups. In this talk, we will demonstrate the mechanisms and primitives provided by both SYCL and OpenVINO to optimize the dataflow between, and efficient execution of the workloads implemented in the respective APIs. We will provide an example and show the impact on the overall throughput and latency of the end-to-end processing pipeline. The audience will learn to recognize inefficiencies in their pipelines using profiling tools, and understand how to optimize those inefficiencies using an easy-to-follow optimization recipe. Finally, we will provide guidance to developers of inference engines other than OpenVINO on how to integrate similar interoperability features into their APIs, so that they too can offer optimized SYCL-enabled AI pipelines to their users.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78305996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TAU Performance System TAU绩效系统
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529557
S. Shende
{"title":"TAU Performance System","authors":"S. Shende","doi":"10.1145/3529538.3529557","DOIUrl":"https://doi.org/10.1145/3529538.3529557","url":null,"abstract":"The TAU Performance System 1 is a versatile performance evaluation tool that supports OpenCL, DPC++/SYCL, OpenMP, and other GPU runtimes. It features a performance profiling and tracing module that is widely portable and can access hardware performance counter data at the GPU and CPU level. This talk will describe the usage and new features of TAU for performance evaluation of HPC and AI/ML workloads. TAU is integrated in the Extreme-Scale Scientific Software Stack (E4S) 2 and is available in containerized and cloud environments. The talk/tutorial will demonstrate the usage of TAU on uninstrumented applications.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80969006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OpenCLML Integration with TVM OpenCLML与TVM的集成
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530003
Siva Rama Krishna Reddy, Hongqiang Wang, Alex Bourd, Adarsh Golikeri, Balaji Calidas
{"title":"OpenCLML Integration with TVM","authors":"Siva Rama Krishna Reddy, Hongqiang Wang, Alex Bourd, Adarsh Golikeri, Balaji Calidas","doi":"10.1145/3529538.3530003","DOIUrl":"https://doi.org/10.1145/3529538.3530003","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89870756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
How to optimize Compute Drivers? Let’s start with writing good benchmarks! 如何优化计算驱动程序?让我们从编写好的基准开始吧!
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529569
Michał Mrozek
{"title":"How to optimize Compute Drivers? Let’s start with writing good benchmarks!","authors":"Michał Mrozek","doi":"10.1145/3529538.3529569","DOIUrl":"https://doi.org/10.1145/3529538.3529569","url":null,"abstract":"Writing efficient driver stack is the goal of every driver developer, but to see if your stack is performant, you need tools that will confirm this. You may try to run workloads and benchmarks and see how your driver perform, but this will only give you a summarized score, consisting of many pieces. To further optimize this, you need to take extensive steps in understanding the applications, figuring out what is the bottleneck and optimizing it, is quite time-consuming process involving a lot of effort. This created a need for driver team to write a tool, that would make performance work on the driver easier, so we created compute benchmarks. In this suite we test all aspects of driver stack to see if they do not have any bottlenecks. Each test checks only one thing and does this in isolation, so it is very easy to work on optimizing it and doesn’t require any extensive setup. Benchmarks focus on such subtle aspect of every driver as: API overhead of every call, submission latencies, resource creation costs, transfer bandwidths, multi-threaded contention, multi process execution and many others. Framework offers capabilities for multiple backends, currently we have OpenCL and Level Zero implementations in place, so it is very easy to compare how the same scenario is services with different drivers. It is also very easy to compare driver implementations between vendors, as tests written in OpenCL simply work across different GPU implementations. We also use this code to present good and bad coding practices, this is very useful to showcase how simple things can drastically improve performance and users can simply run those scenarios and see how performance changes on their own setups. It is also a great tool to prototype new extensions and further propose them as a part of OpenCL standard. We plan to Open Source this project in Q2 2022, it is expected to be already available during IWOCL.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88400444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved address space inference for SYCL programs 改进了SYCL程序的地址空间推断
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529998
Ross Brunton, V. Lomüller
{"title":"Improved address space inference for SYCL programs","authors":"Ross Brunton, V. Lomüller","doi":"10.1145/3529538.3529998","DOIUrl":"https://doi.org/10.1145/3529538.3529998","url":null,"abstract":"SYCL[4, 6] is a single source C++ based programming model for heterogeneous programming. It enables the programmer to write or port code targeting heterogeneous accelerators using what appears to the programmer as standard C++. To achieve peak performance, however, it can be necessary to write the code in a form which allows the compiler to target specific hardware features. If the compiler can target these hardware features without requiring the programmer to consider them, then productivity and application performance can both be improved. One such example is accelerators with multiple address spaces, this technical talk will describe how a SYCL compiler can infer these address spaces without requiring the programmer to specify them in their application as well as describe some required specification evolution in order to better cope with the new SYCL 2020 features. Hardware devices can have multiple memory regions with different levels of visibility and performance. Similar to OpenCL C[5], SYCL abstracts them into a global memory visible to all work-items, a local memory visible to a single work-group, or a private memory only visible to a single work-item. In OpenCL C, the programmer expresses address spaces using type qualifiers in order to statically encode the memory region addressed by pointers thus ensuring that when a programmer does specify an address space the compiler can check whether the program is well-formed. But requiring programs to be written with explicit address spaces comes at the expense of usability, as these need to be integrated into the program design and are a barrier to integrate code not written with this in mind. Thus in OpenCL C 2.x/3 programmers can make use of the unnamed generic address space instead. On the other hand, SYCL does not extend the C++ language therefore programmers cannot express address spaces using a type qualifier (as the C++ standard does not define them). Thus in SYCL pointers and references can be lowered to this unnamed generic address space by the device compiler. This generic address space is a virtual address space that can represent several overlapping address spaces at the same time. The memory being addressed is no longer statically known by the compiler frontend and the SYCL implementation relies on the hardware, or software emulation, to correctly dispatch the loads and stores to the correct memory. On some hardware targets this flexibility comes with a performance cost, but this can be avoided when the compiler can infer a single address space for a given memory access. Additionally, the low-level compute APIs that are often used as backends to a SYCL 2020 implementation do not guarantee support for a generic address space, e.g. they are an optional feature in OpenCL 3.0 and non-existent in Vulkan. This means that a SYCL compiler that can infer all address spaces for a large set of programs can achieve better performance and target a wider range of backend compute APIs. Moreover, r","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77364467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SYCL Concurrency on GPU Platforms: Empirical Measurement GPU平台上的SYCL并发性:经验测量
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529989
T. Applencourt, Abhishek Bagusetty, Ajay Panyala, Aksel Alpay
{"title":"SYCL Concurrency on GPU Platforms: Empirical Measurement","authors":"T. Applencourt, Abhishek Bagusetty, Ajay Panyala, Aksel Alpay","doi":"10.1145/3529538.3529989","DOIUrl":"https://doi.org/10.1145/3529538.3529989","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86270161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL 基于SYCL的结构网格显式和隐式数值求解的FPGA加速
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3530007
Kamalavasan Kamalakkannan, G. Mudalige, I. Reguly, Suhaib A. Fahmy
{"title":"FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL","authors":"Kamalavasan Kamalakkannan, G. Mudalige, I. Reguly, Suhaib A. Fahmy","doi":"10.1145/3529538.3530007","DOIUrl":"https://doi.org/10.1145/3529538.3530007","url":null,"abstract":"We explore the design and development of structured-mesh-based solvers on Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) multi-dimensional tridiagonal solvers based on implicit methods. Both classes of solvers appear as core modules in a variety of real-world applications ranging from computational fluid dynamics to financial computing. A general, unified workflow is formulated for synthesizing these applications on Intel FPGAs together with predictive analytic models to explore the design space to obtain optimized performance. Performance of synthesized designs, using the above techniques, for two non-trivial applications on an Intel PAC D5005 FPGA card is benchmarked. Results are compared to the performance of optimized parallel implementations of the same applications on a Nvidia V100 GPU. Observed runtime results indicate the FPGA providing comparable or improved performance to the V100 GPU. However, more importantly the FPGA solutions consume 59%–76% less energy for their largest configurations. Our performance model predicts the runtime of designs with high accuracy with less than 5% error for all cases tested, demonstrating significant utility for design space exploration. With these tools and techniques, we discuss determinants for a given structured-mesh code to be amenable to FPGA implementation, providing insights into the feasibility and profitability FPGA implementation, how to code designs using SYCL, and the resulting performance.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89540801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
How much SYCL does a compiler need? Experiences from the implementation of SYCL as a library for nvc++ 编译器需要多少SYCL ?SYCL作为nvc++库的实现经验
International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529556
Aksel Alpay, V. Heuveline
{"title":"How much SYCL does a compiler need? Experiences from the implementation of SYCL as a library for nvc++","authors":"Aksel Alpay, V. Heuveline","doi":"10.1145/3529538.3529556","DOIUrl":"https://doi.org/10.1145/3529538.3529556","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87628873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信