International Workshop on OpenCL最新文献

筛选
英文 中文
Toward Evaluating High-Level Synthesis Portability and Performance between Intel and Xilinx FPGAs 对英特尔和赛灵思fpga之间高级综合可移植性和性能的评估
International Workshop on OpenCL Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456699
A. Cabrera, Aaron R. Young, Jacob Lambert, Zhili Xiao, Amy An, Seyong Lee, Zheming Jin, Jungwon Kim, J. Buhler, R. Chamberlain, J. Vetter
{"title":"Toward Evaluating High-Level Synthesis Portability and Performance between Intel and Xilinx FPGAs","authors":"A. Cabrera, Aaron R. Young, Jacob Lambert, Zhili Xiao, Amy An, Seyong Lee, Zheming Jin, Jungwon Kim, J. Buhler, R. Chamberlain, J. Vetter","doi":"10.1145/3456669.3456699","DOIUrl":"https://doi.org/10.1145/3456669.3456699","url":null,"abstract":"Offloading computation from a CPU to a hardware accelerator is becoming a more common solution for improving performance because traditional gains enabled by Moore’s law and Dennard scaling have slowed. GPUs are often used as hardware accelerators, but field-programmable gate arrays (FPGAs) are gaining traction. FPGAs are beneficial because they allow hardware specific to a particular application to be created. However, they are notoriously difficult to program. To this end, two of the main FPGA manufacturers, Intel and Xilinx, have created tools and frameworks that enable the use of higher level languages to design FPGA hardware. Although Xilinx kernels can be designed by using C/C++, both Intel and Xilinx support the use of OpenCL C to architect FPGA hardware. However, not much is known about the portability and performance between these two device families other than the fact that it is theoretically possible to synthesize a kernel meant for Intel to Xilinx and vice versa. In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We use OpenCL C implementations of a subset of the Rodinia benchmarking suite that were designed for an Intel FPGA and make the necessary modifications to create synthesizable OpenCL C kernels for a Xilinx FPGA. We find that the difficulty of porting certain kernel optimizations varies, depending on the construct. Once the minimum amount of modifications is made to create synthesizable hardware for the Xilinx platform, more nontrivial work is needed to improve performance. However, we find that constructs that are known to be performant for an FPGA should improve performance regardless of the platform; the difficulty comes in deciding how to invoke certain kernel optimizations while also abiding by the constraints enforced by a given platform’s hardware compiler.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84897389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Profiling Heterogeneous Computing Performance with VTune Profiler 用VTune Profiler分析异构计算性能
International Workshop on OpenCL Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456678
V. Tsymbal, Alexandr Kurylev
{"title":"Profiling Heterogeneous Computing Performance with VTune Profiler","authors":"V. Tsymbal, Alexandr Kurylev","doi":"10.1145/3456669.3456678","DOIUrl":"https://doi.org/10.1145/3456669.3456678","url":null,"abstract":"Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions. In this presentation we will introduce two new GPU performance analysis types in Intel® VTune™ Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload. The GPU Offload analysis helps to identify how CPU is interacting with GPU(s) by creating and submitting tasks to offload queues. It provides metrics and performance data such as GPU Utilization, Hottest GPU Computing Tasks, Tasks instance count and timing, kernel Data Transfer Size, SIMD Width measurements, GPU Execution Units (EU) threads occupancy, and Memory Utilization. All together the metrics are providing a systematic picture on how effectively tasks were offloaded and executed on GPUs. The GPU Hotspots analysis is intended to examine computing tasks or kernels efficiency running on GPU EUs and interacting with GPU memory subsystem. Inefficiencies that are conditioned by compute kernels implementation or compiler issues are resulting in idling of EUs or increased latencies in data fetching from memory sources to EU registers, which is eventually leading to performance degradation. Due to complexity of GPU memory subsystem (L1, L2 Caches, Shared Local Memory, L3 Cache, GPU DRAM, CPU LLC and DRAM), analyzing data access inefficiencies is even more problematic. The GPU Hotspots analysis is addressing those problems by presenting a visualization of a current GPU Memory Hierarchy Diagram, detailed data transfer tracing between different memory agents, memory bandwidth measurements, barriers and atomics analysis. In addition, VTune is analyzing each compute kernel on a source level, providing performance metrics against source lines or assembly instructions. ","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84364109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
International Workshop on OpenCL OpenCL国际研讨会
International Workshop on OpenCL Pub Date : 2021-01-01 DOI: 10.1145/3456669
{"title":"International Workshop on OpenCL","authors":"","doi":"10.1145/3456669","DOIUrl":"https://doi.org/10.1145/3456669","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91220179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
IWOCL '20: International Workshop on OpenCL, Virtual Event / Munich, Germany, April 27-29, 2020 IWOCL '20: OpenCL国际研讨会,虚拟事件/德国慕尼黑,2020年4月27-29日
International Workshop on OpenCL Pub Date : 2020-01-01 DOI: 10.1145/3388333
{"title":"IWOCL '20: International Workshop on OpenCL, Virtual Event / Munich, Germany, April 27-29, 2020","authors":"","doi":"10.1145/3388333","DOIUrl":"https://doi.org/10.1145/3388333","url":null,"abstract":"","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76080501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL 芯片上的Gzip:在fpga上使用OpenCL进行高性能无损数据压缩
International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664670
M. Abdelfattah, A. Hagiescu, Deshanand P. Singh
{"title":"Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL","authors":"M. Abdelfattah, A. Hagiescu, Deshanand P. Singh","doi":"10.1145/2664666.2664670","DOIUrl":"https://doi.org/10.1145/2664666.2664670","url":null,"abstract":"Hardware implementation of lossless data compression is important for optimizing the capacity/cost/power of storage devices in data centers, as well as communication channels in high-speed networks. In this work we use the Open Computing Language (OpenCL) to implement high-speed data compression (Gzip) on a field-programmable gate-arrays (FPGA). We show how we make use of a heavily-pipelined custom hardware implementation to achieve the high throughput of ~3 GB/s with more than 2x compression ratio over standard compression benchmarks. When compared against a highly-tuned CPU implementation, the performance-per-watt of our OpenCL FPGA implementation is 12x better and compression ratio is on-par. Additionally, we compare our implementation to a hand-coded commercial implementation of Gzip to quantify the gap between a high-level language like OpenCL, and a hardware description language like Verilog. OpenCL performance is 5.3% lower than Verilog, and area is 2% more logic and 25% more of the FPGA's available memory resources but the productivity gains are significant.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73207777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 114
Evaluation of a performance portable lattice Boltzmann code using OpenCL 用OpenCL评价便携式晶格玻尔兹曼代码的性能
International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664668
Simon McIntosh-Smith, Dan Curran
{"title":"Evaluation of a performance portable lattice Boltzmann code using OpenCL","authors":"Simon McIntosh-Smith, Dan Curran","doi":"10.1145/2664666.2664668","DOIUrl":"https://doi.org/10.1145/2664666.2664668","url":null,"abstract":"With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area --- structured grid codes --- and investigated techniques exploiting OpenCL to enable performance portability across a diverse range of high-end many-core architectures. In particular we have chosen to investigate 3D lattice Boltzmann codes (D3Q19 BGK). We have developed an OpenCL version of this code in order to provide cross-platform functional portability, and compared the performance of this OpenCL version to optimized native versions on each target platform, including hybrid OpenMP/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for 3D lattice Boltzmann codes, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88601693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Generating OpenCL C kernels from OpenACC 从OpenACC生成OpenCL C内核
International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664675
T. Vanderbruggen, John Cavazos
{"title":"Generating OpenCL C kernels from OpenACC","authors":"T. Vanderbruggen, John Cavazos","doi":"10.1145/2664666.2664675","DOIUrl":"https://doi.org/10.1145/2664666.2664675","url":null,"abstract":"Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target hardware accelerators with minimal modification of the original application. In this paper, we present the generation of OpenCL C kernels from OpenACC annotated codes. We introduce a method to produce multiple kernels for each OpenACC compute region.\u0000 We evaluate these kernels on different hardware accelerators (NVidia GPU, Intel MIC). Finally, we show that the produced kernels give different performances for different accelerators. Hence this method produces a tuning space in which we can search for the best kernel version for a given accelerator.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88777392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters KernelInterceptor:通过拦截内核及其参数来自动化GPU内核验证
International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664673
E. Bardsley, A. Donaldson, John Wickerson
{"title":"KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters","authors":"E. Bardsley, A. Donaldson, John Wickerson","doi":"10.1145/2664666.2664673","DOIUrl":"https://doi.org/10.1145/2664666.2664673","url":null,"abstract":"GPUVerify is a static analysis tool for verifying that GPU kernels are free from data races and barrier divergence. It is intended as an automatic tool, but its usability is impaired by the fact that the user must explicitly supply the kernel source code, the number of work items and work groups, and preconditions on key kernel arguments. Extracting this information from non-trivial OpenCL applications is laborious and error-prone.\u0000 We describe an extension to GPUVerify, called KernelInterceptor, that automates the extraction of this information from a given OpenCL application. After recompiling the application having included an additional header file, and linking with an additional library, KernelInterceptor is able to detect each dynamic kernel launch and record the values of the various parameters in a series of log files. GPUVerify can then be invoked to examine these log files and verify each kernel instance. We explain how the interception mechanism works, and comment on the extent to which it improves the usability of GPUVerify.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77321939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
clMAGMA: high performance dense linear algebra with OpenCL clMAGMA:基于OpenCL的高性能密集线性代数
International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664667
Chongxiao Cao, J. Dongarra, Peng Du, M. Gates, P. Luszczek, S. Tomov
{"title":"clMAGMA: high performance dense linear algebra with OpenCL","authors":"Chongxiao Cao, J. Dongarra, Peng Du, M. Gates, P. Luszczek, S. Tomov","doi":"10.1145/2664666.2664667","DOIUrl":"https://doi.org/10.1145/2664666.2664667","url":null,"abstract":"This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates various optimizations, and in general provides the DLA functionality of the popular LAPACK library on heterogeneous architectures. The LAPACK compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portable performance. High performance is obtained through the use of the high-performance OpenCL BLAS, hardware- and OpenCL-specific tuning, and a hybridization methodology, where we split the algorithm into computational tasks of various granularities. Execution of those tasks is efficiently scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86114507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Performance portability study of linear algebra kernels in OpenCL OpenCL中线性代数核的性能可移植性研究
International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664674
K. Rupp, Philippe Tillet, F. Rudolf, J. Weinbub, T. Grasser, A. Jüngel
{"title":"Performance portability study of linear algebra kernels in OpenCL","authors":"K. Rupp, Philippe Tillet, F. Rudolf, J. Weinbub, T. Grasser, A. Jüngel","doi":"10.1145/2664666.2664674","DOIUrl":"https://doi.org/10.1145/2664666.2664674","url":null,"abstract":"The performance portability of OpenCL kernel implementations for common memory bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors is studied. Certain combinations of kernel implementations and work sizes are found to exhibit good performance across compute kernels, hardware generations, and, to a lesser degree, vendors. As a consequence, it is demonstrated that the optimization of a single kernel is often sufficient to obtain good performance for a large class of more complicated operations.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74907290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信