International Workshop on OpenCL最新文献_第6页

C++ for OpenCL 2021 c++ for OpenCL 2021

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529981

Justas Janickas, Anastasia Stulova

引用次数: 0

On the Compilation Performance of Current SYCL Implementations 论当前SYCL实现的编译性能

International Workshop on OpenCL Pub Date : 2022-05-10 DOI: 10.1145/3529538.3529548

Peter Thoman, Facundo Molina Heredia, T. Fahringer

{"title":"On the Compilation Performance of Current SYCL Implementations","authors":"Peter Thoman, Facundo Molina Heredia, T. Fahringer","doi":"10.1145/3529538.3529548","DOIUrl":"https://doi.org/10.1145/3529538.3529548","url":null,"abstract":"The Khronos SYCL abstraction layer is designed to enable programming heterogeneous platforms, consisting of host and accelerator devices, with a single-source code base. In order to allow for a high level of abstraction while still providing competitive runtime performance, both SYCL implementations and the software ecosystems built around SYCL applications frequently make heavy use of C++ templates. A potential consequence of this design choice, as well as the need to generate code for both a host and at least one device architecture, are significant compilation times. In this work we set out to study the relative compile-time performance and the impact of various SYCL features on compilation times across a selection of the most widely-used SYCL implementations. To this end, we introduce a code generator which creates SYCL kernels stressing various API features and instruction types, either in isolation or in combination, as well as an infrastructure to largely automate related experiments. We apply this infrastructure in a large-scale synthetic evaluation totaling 96000 compiler runs, which also includes a study of the compilation performance over time of the most widespread implementations. In addition to these synthetic experiments, we validate the applicability of our findings by measuring the compile times of two real-world industrial SYCL applications. On the basis of these experiments, we point out particularly impactful – in terms of compile-time performance – changes during the development of some SYCL implementations, and formulate suggestions for SYCL implementation developers as well as users. We have made both the code generator and all the tools we developed to carry out the experiments in this paper available as open source.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85162004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Experiences Porting NAMD to the Data Parallel C++ Programming Model. 将 NAMD 移植到数据并行 C++ 编程模型的经验。

International Workshop on OpenCL Pub Date : 2022-05-01 Epub Date: 2022-05-10 DOI: 10.1145/3529538.3529560

David J Hardy, Jaemin Choi, Wei Jiang, Emad Tajkhorshid

引用次数: 0

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library 对基于sycl的便携式快速傅里叶变换库进行基准测试

International Workshop on OpenCL Pub Date : 2022-03-17 DOI: 10.1145/3529538.3529996

V. Pascuzzi, M. Goli

引用次数: 1

International Workshop on OpenCL OpenCL国际研讨会

International Workshop on OpenCL Pub Date : 2022-01-01 DOI: 10.1145/3529538

引用次数: 0

Enabling the Use of C++20 Unseq Execution Policy for OpenCL 在OpenCL中启用c++ 20 Unseq执行策略

International Workshop on OpenCL Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456674

Po-Yao Chang, Tai-Liang Chen, Jenq-Kuen Lee

引用次数: 2

Toward a Better Defined SYCL Memory Consistency Model 更好地定义SYCL内存一致性模型

International Workshop on OpenCL Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456696

Ben Ashbaugh, James C. Brodman, M. Kinsner, G. Lueck, S. Pennycook, Roland Schulz

引用次数: 1

Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis 基于高级合成的fpga正则表达式匹配加速

International Workshop on OpenCL Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456716

Devon Callanan, Luke Kljucaric, A. George

{"title":"Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis","authors":"Devon Callanan, Luke Kljucaric, A. George","doi":"10.1145/3456669.3456716","DOIUrl":"https://doi.org/10.1145/3456669.3456716","url":null,"abstract":"The importance of security infrastructures for high-throughput networks has rapidly grown as a result of expanding internet traffic and increasingly high-bandwidth connections. Intrusion-detection systems (IDSs) such as SNORT rely upon rule sets designed to alert system administrators of malicious packets. Such deep-packet inspection, which depends upon regular-expression searches, can be accelerated on programmable-logic (PL) architectures using non-deterministic finite automata (NFAs). Prior designs have relied upon register-transfer level (RTL) design descriptions and achieved efficient resource utilization through fine-grained optimizations. New advances made by field-programmable gate array (FPGA) vendors have led to more powerful compiler toolchains for OpenCL that allow for rapid development on PL architectures while generating competitive designs in terms of performance. The goal of this research is to evaluate performance differences between a custom, OpenCL-based, acceleration architecture for regular expressions and comparable RTL designs. The simplicity of the application, which requires only basic hardware building blocks, adds to the novelty of the comparison. In contrast to RTL-based solutions, which show frequency degradation with bandwidth scaling, our approach is able to maintain stable and high operating frequencies at the cost of resource usage. By scaling input bandwidth with multi-character transformations, throughput in excess of 17 Gbps can be achieved on Intel’s Arria 10 Programmable Acceleration Card, outperforming similar designs with RTL as reported in the literature.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85255478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

On measuring the maturity of SYCL implementations by tracking historical performance improvements 通过跟踪历史性能改进来度量SYCL实现的成熟度

International Workshop on OpenCL Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456701

Wei-Chen Lin, Tom Deakin, Simon McIntosh-Smith

{"title":"On measuring the maturity of SYCL implementations by tracking historical performance improvements","authors":"Wei-Chen Lin, Tom Deakin, Simon McIntosh-Smith","doi":"10.1145/3456669.3456701","DOIUrl":"https://doi.org/10.1145/3456669.3456701","url":null,"abstract":"SYCL is a platform agnostic, single-source, C++ based, parallel programming framework for developing platform independent software for heterogeneous systems. As an emerging framework, SYCL has been under active development for several years, with multiple implementations available from hardware vendors and others. A crucial metric for potential adopters is how mature these implementations are; are they still improving rapidly, indicating that the space is still quite immature, or has performance improvement plateaued, potentially indicating a mature market? This study presents a historical study of the performance delivered by all major SYCL implementations on a range of supported platforms. We use existing HPC-style mini-apps written in SYCL, and benchmark these on current and historical revisions of each SYCL implementation, revealing the rate of change of performance improvements over time. The data indicates that most SYCL implementations are now quite mature, showing rapid performance improvements in the past, slowing to more modest performance improvements more recently. We also compare the most recent SYCL performance to existing well established frameworks, such as OpenCL and OpenMP.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"203 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74429020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs 将SU3_Bench微基准测试移植到Intel Arria 10和Xilinx Alveo U280 fpga上的经验

International Workshop on OpenCL Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456671

D. Doerfler, Farzad Fatollahi-Fard, Colin MacLean, T. Nguyen, Samuel Williams, N. Wright, Marco Siracusa

{"title":"Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs","authors":"D. Doerfler, Farzad Fatollahi-Fard, Colin MacLean, T. Nguyen, Samuel Williams, N. Wright, Marco Siracusa","doi":"10.1145/3456669.3456671","DOIUrl":"https://doi.org/10.1145/3456669.3456671","url":null,"abstract":"In this study we investigate the implications of porting a common computational kernel used in high performance computing, which has been optimized for efficient execution on general purpose graphics processing units (GPUs), to a field programmable gate array (FPGA). In particular, we use a benchmark based on a matrix-matrix multiply kernel commonly used in lattice quantum chromodynamics applications. The microbenchmark is based on the OpenCL programming language. We evaluate the performance, and portability, aspects associated for two FPGAs, the Intel Arria 10 and the Xilinx Alveo U280. The purpose of the study is not to compare the two FPGAs, but to evaluate their respective OpenCL toolchains and to evaluate the level of effort needed to port a GPU optimized code to a FPGA, and the effectiveness of the respective toolchains. We did find the toolchains to be relatively easy to use, and it was possible to get correctness with little effort, but there was significant effort needed to get relatively good performance. We found that FPGAs perform best when using single work item kernels, as opposed to the nominal multiple work item NDRange kernel used for CPUs and GPUs. In addition, other source code changes were necessary, and in particular the lack of a local cache in FPGA architectures can require a significant rewrite of the code. The performance achieved with the Intel Arria 10 was 47.6% of its maximum sustained bandwidth, while the Xilinx Alveo U280 achieved 35.2%. GPU architectures have been shown to demonstrate 75% to 90% architectural efficiencies.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76145578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0