International Workshop on OpenCL最新文献_第9页

Toward Evaluating High-Level Synthesis Portability and Performance between Intel and Xilinx FPGAs 对英特尔和赛灵思fpga之间高级综合可移植性和性能的评估

International Workshop on OpenCL Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456699

A. Cabrera, Aaron R. Young, Jacob Lambert, Zhili Xiao, Amy An, Seyong Lee, Zheming Jin, Jungwon Kim, J. Buhler, R. Chamberlain, J. Vetter

{"title":"Toward Evaluating High-Level Synthesis Portability and Performance between Intel and Xilinx FPGAs","authors":"A. Cabrera, Aaron R. Young, Jacob Lambert, Zhili Xiao, Amy An, Seyong Lee, Zheming Jin, Jungwon Kim, J. Buhler, R. Chamberlain, J. Vetter","doi":"10.1145/3456669.3456699","DOIUrl":"https://doi.org/10.1145/3456669.3456699","url":null,"abstract":"Offloading computation from a CPU to a hardware accelerator is becoming a more common solution for improving performance because traditional gains enabled by Moore’s law and Dennard scaling have slowed. GPUs are often used as hardware accelerators, but field-programmable gate arrays (FPGAs) are gaining traction. FPGAs are beneficial because they allow hardware specific to a particular application to be created. However, they are notoriously difficult to program. To this end, two of the main FPGA manufacturers, Intel and Xilinx, have created tools and frameworks that enable the use of higher level languages to design FPGA hardware. Although Xilinx kernels can be designed by using C/C++, both Intel and Xilinx support the use of OpenCL C to architect FPGA hardware. However, not much is known about the portability and performance between these two device families other than the fact that it is theoretically possible to synthesize a kernel meant for Intel to Xilinx and vice versa. In this work, we evaluate the portability and performance of Intel and Xilinx kernels. We use OpenCL C implementations of a subset of the Rodinia benchmarking suite that were designed for an Intel FPGA and make the necessary modifications to create synthesizable OpenCL C kernels for a Xilinx FPGA. We find that the difficulty of porting certain kernel optimizations varies, depending on the construct. Once the minimum amount of modifications is made to create synthesizable hardware for the Xilinx platform, more nontrivial work is needed to improve performance. However, we find that constructs that are known to be performant for an FPGA should improve performance regardless of the platform; the difficulty comes in deciding how to invoke certain kernel optimizations while also abiding by the constraints enforced by a given platform’s hardware compiler.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84897389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Profiling Heterogeneous Computing Performance with VTune Profiler 用VTune Profiler分析异构计算性能

International Workshop on OpenCL Pub Date : 2021-04-27 DOI: 10.1145/3456669.3456678

V. Tsymbal, Alexandr Kurylev

{"title":"Profiling Heterogeneous Computing Performance with VTune Profiler","authors":"V. Tsymbal, Alexandr Kurylev","doi":"10.1145/3456669.3456678","DOIUrl":"https://doi.org/10.1145/3456669.3456678","url":null,"abstract":"Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions. In this presentation we will introduce two new GPU performance analysis types in Intel® VTune™ Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload. The GPU Offload analysis helps to identify how CPU is interacting with GPU(s) by creating and submitting tasks to offload queues. It provides metrics and performance data such as GPU Utilization, Hottest GPU Computing Tasks, Tasks instance count and timing, kernel Data Transfer Size, SIMD Width measurements, GPU Execution Units (EU) threads occupancy, and Memory Utilization. All together the metrics are providing a systematic picture on how effectively tasks were offloaded and executed on GPUs. The GPU Hotspots analysis is intended to examine computing tasks or kernels efficiency running on GPU EUs and interacting with GPU memory subsystem. Inefficiencies that are conditioned by compute kernels implementation or compiler issues are resulting in idling of EUs or increased latencies in data fetching from memory sources to EU registers, which is eventually leading to performance degradation. Due to complexity of GPU memory subsystem (L1, L2 Caches, Shared Local Memory, L3 Cache, GPU DRAM, CPU LLC and DRAM), analyzing data access inefficiencies is even more problematic. The GPU Hotspots analysis is addressing those problems by presenting a visualization of a current GPU Memory Hierarchy Diagram, detailed data transfer tracing between different memory agents, memory bandwidth measurements, barriers and atomics analysis. In addition, VTune is analyzing each compute kernel on a source level, providing performance metrics against source lines or assembly instructions. ","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84364109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

International Workshop on OpenCL OpenCL国际研讨会

International Workshop on OpenCL Pub Date : 2021-01-01 DOI: 10.1145/3456669

引用次数: 1

IWOCL '20: International Workshop on OpenCL, Virtual Event / Munich, Germany, April 27-29, 2020 IWOCL '20: OpenCL国际研讨会，虚拟事件/德国慕尼黑，2020年4月27-29日

International Workshop on OpenCL Pub Date : 2020-01-01 DOI: 10.1145/3388333

引用次数: 0

Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL 芯片上的Gzip:在fpga上使用OpenCL进行高性能无损数据压缩

International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664670

M. Abdelfattah, A. Hagiescu, Deshanand P. Singh

引用次数: 114

Evaluation of a performance portable lattice Boltzmann code using OpenCL 用OpenCL评价便携式晶格玻尔兹曼代码的性能

International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664668

Simon McIntosh-Smith, Dan Curran

{"title":"Evaluation of a performance portable lattice Boltzmann code using OpenCL","authors":"Simon McIntosh-Smith, Dan Curran","doi":"10.1145/2664666.2664668","DOIUrl":"https://doi.org/10.1145/2664666.2664668","url":null,"abstract":"With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area --- structured grid codes --- and investigated techniques exploiting OpenCL to enable performance portability across a diverse range of high-end many-core architectures. In particular we have chosen to investigate 3D lattice Boltzmann codes (D3Q19 BGK). We have developed an OpenCL version of this code in order to provide cross-platform functional portability, and compared the performance of this OpenCL version to optimized native versions on each target platform, including hybrid OpenMP/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for 3D lattice Boltzmann codes, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"28 1","pages":"2:1-2:12"},"PeriodicalIF":0.0,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88601693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Generating OpenCL C kernels from OpenACC 从OpenACC生成OpenCL C内核

International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664675

T. Vanderbruggen, John Cavazos

引用次数: 4

KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters KernelInterceptor:通过拦截内核及其参数来自动化GPU内核验证

International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664673

E. Bardsley, A. Donaldson, John Wickerson

引用次数: 6

clMAGMA: high performance dense linear algebra with OpenCL clMAGMA:基于OpenCL的高性能密集线性代数

International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664667

Chongxiao Cao, J. Dongarra, Peng Du, M. Gates, P. Luszczek, S. Tomov

引用次数: 32

Performance portability study of linear algebra kernels in OpenCL OpenCL中线性代数核的性能可移植性研究

International Workshop on OpenCL Pub Date : 2014-05-12 DOI: 10.1145/2664666.2664674

K. Rupp, Philippe Tillet, F. Rudolf, J. Weinbub, T. Grasser, A. Jüngel

引用次数: 7