Profiling Heterogeneous Computing Performance with VTune Profiler

International Workshop on OpenCL Pub Date : 2021-04-27 DOI:10.1145/3456669.3456678

V. Tsymbal, Alexandr Kurylev

{"title":"Profiling Heterogeneous Computing Performance with VTune Profiler","authors":"V. Tsymbal, Alexandr Kurylev","doi":"10.1145/3456669.3456678","DOIUrl":null,"url":null,"abstract":"Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions. In this presentation we will introduce two new GPU performance analysis types in Intel® VTune™ Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload. The GPU Offload analysis helps to identify how CPU is interacting with GPU(s) by creating and submitting tasks to offload queues. It provides metrics and performance data such as GPU Utilization, Hottest GPU Computing Tasks, Tasks instance count and timing, kernel Data Transfer Size, SIMD Width measurements, GPU Execution Units (EU) threads occupancy, and Memory Utilization. All together the metrics are providing a systematic picture on how effectively tasks were offloaded and executed on GPUs. The GPU Hotspots analysis is intended to examine computing tasks or kernels efficiency running on GPU EUs and interacting with GPU memory subsystem. Inefficiencies that are conditioned by compute kernels implementation or compiler issues are resulting in idling of EUs or increased latencies in data fetching from memory sources to EU registers, which is eventually leading to performance degradation. Due to complexity of GPU memory subsystem (L1, L2 Caches, Shared Local Memory, L3 Cache, GPU DRAM, CPU LLC and DRAM), analyzing data access inefficiencies is even more problematic. The GPU Hotspots analysis is addressing those problems by presenting a visualization of a current GPU Memory Hierarchy Diagram, detailed data transfer tracing between different memory agents, memory bandwidth measurements, barriers and atomics analysis. In addition, VTune is analyzing each compute kernel on a source level, providing performance metrics against source lines or assembly instructions. Memory Latency metrics are helping to determine most inefficient data accesses on a source line level. Supplementary GPU Instruction Count analysis clarifies with instruction set in a kernel generated by a Compiler. The GPU analyses in VTune are well developed for OpenCL language and run-time, however the most recent SYCL language and its extension Data Parallel C++ along with Level Zero run-time are supported as well, running on all Intel GPUs from Gen9 HD Graphics to Intel Iris Xe Graphics (a discrete GPU card). Results of performance profiling on different GPU architectures will be presented in the session. VTune Profiler for GPUs is a newly extended toolset which is being actively developed along with development of new acceleration architectures at Intel. New features and analysis concepts are constantly appearing in the tool fulfilling the needs of software architects and developers.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3456669.3456678","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Programming of heterogeneous platforms requires deep understanding of system architecture on all levels, which help applications design to leveraging the best data and work decomposition between CPU and an accelerating hardware like GPUs. However, in many cases the applications are being converted form a conventional CPU programming language like C++, or from accelerator friendly but still low level languages like OpenCL, and the main problem is to determine which part of the application is leveraging from being offloaded to GPU. Another problem is to estimate, how much performance increase one might gain due to the accelerating in the particular GP GPU device. Each platform has its unique limitations that are affecting performance of offloaded computing tasks, e.g. data transfer tax, task initialization overhead, memory latency and bandwidth limitations. In order to take into account those constraints, software developers need tooling for collecting right information and producing recommendations to make the best design and optimization decisions. In this presentation we will introduce two new GPU performance analysis types in Intel® VTune™ Profiler, and a methodology of heterogeneous applications performance profiling supported by the analyses. VTune Profiler is a well-known tool for performance characterization on CPUs, now it includes GPU Offload Analysis and GPU Hotspots Analysis of applications written on most offloading models with OpenCL, SYCL/Data Parallel C++, and OpenMP Offload. The GPU Offload analysis helps to identify how CPU is interacting with GPU(s) by creating and submitting tasks to offload queues. It provides metrics and performance data such as GPU Utilization, Hottest GPU Computing Tasks, Tasks instance count and timing, kernel Data Transfer Size, SIMD Width measurements, GPU Execution Units (EU) threads occupancy, and Memory Utilization. All together the metrics are providing a systematic picture on how effectively tasks were offloaded and executed on GPUs. The GPU Hotspots analysis is intended to examine computing tasks or kernels efficiency running on GPU EUs and interacting with GPU memory subsystem. Inefficiencies that are conditioned by compute kernels implementation or compiler issues are resulting in idling of EUs or increased latencies in data fetching from memory sources to EU registers, which is eventually leading to performance degradation. Due to complexity of GPU memory subsystem (L1, L2 Caches, Shared Local Memory, L3 Cache, GPU DRAM, CPU LLC and DRAM), analyzing data access inefficiencies is even more problematic. The GPU Hotspots analysis is addressing those problems by presenting a visualization of a current GPU Memory Hierarchy Diagram, detailed data transfer tracing between different memory agents, memory bandwidth measurements, barriers and atomics analysis. In addition, VTune is analyzing each compute kernel on a source level, providing performance metrics against source lines or assembly instructions. Memory Latency metrics are helping to determine most inefficient data accesses on a source line level. Supplementary GPU Instruction Count analysis clarifies with instruction set in a kernel generated by a Compiler. The GPU analyses in VTune are well developed for OpenCL language and run-time, however the most recent SYCL language and its extension Data Parallel C++ along with Level Zero run-time are supported as well, running on all Intel GPUs from Gen9 HD Graphics to Intel Iris Xe Graphics (a discrete GPU card). Results of performance profiling on different GPU architectures will be presented in the session. VTune Profiler for GPUs is a newly extended toolset which is being actively developed along with development of new acceleration architectures at Intel. New features and analysis concepts are constantly appearing in the tool fulfilling the needs of software architects and developers.

查看原文本刊更多论文

用VTune Profiler分析异构计算性能

异构平台的编程需要对各个层次的系统架构有深刻的理解，这有助于应用程序设计利用CPU和gpu等加速硬件之间的最佳数据和工作分解。然而，在许多情况下，应用程序是从传统的CPU编程语言(如c++)或从加速器友好但仍然是低级语言(如OpenCL)转换而来的，主要问题是确定应用程序的哪一部分正在从卸载到GPU中进行利用。另一个问题是估计，由于特定GP GPU设备的加速，可能会获得多少性能提升。每个平台都有其独特的限制，这些限制会影响卸载计算任务的性能，例如数据传输税、任务初始化开销、内存延迟和带宽限制。为了考虑到这些限制，软件开发人员需要工具来收集正确的信息并产生建议，以做出最佳的设计和优化决策。在本次演讲中，我们将介绍英特尔®VTune™Profiler中的两种新的GPU性能分析类型，以及由分析支持的异构应用程序性能分析方法。VTune Profiler是一个众所周知的cpu性能表征工具，现在它包括GPU卸载分析和GPU热点分析，使用OpenCL, SYCL/Data Parallel c++和OpenMP Offload在大多数卸载模型上编写的应用程序。GPU卸载分析有助于通过创建和提交任务到卸载队列来识别CPU如何与GPU交互。它提供指标和性能数据，如GPU利用率、最热GPU计算任务、任务实例计数和定时、内核数据传输大小、SIMD宽度测量、GPU执行单元(EU)线程占用和内存利用率。所有这些指标都提供了一个系统的图像，显示任务在gpu上卸载和执行的效率。GPU热点分析旨在检查运行在GPU EUs上的计算任务或内核效率，并与GPU内存子系统交互。由计算内核实现或编译器问题导致的低效率会导致EU空闲，或者从内存源到EU寄存器获取数据的延迟增加，这最终会导致性能下降。由于GPU内存子系统(L1, L2缓存，共享本地内存，L3缓存，GPU DRAM, CPU LLC和DRAM)的复杂性，分析数据访问效率低下甚至更成问题。GPU热点分析通过呈现当前GPU内存层次图的可视化、不同内存代理之间的详细数据传输跟踪、内存带宽测量、屏障和原子分析来解决这些问题。此外，VTune在源代码级别上分析每个计算内核，根据源代码行或汇编指令提供性能指标。内存延迟指标有助于确定源行级别上最低效的数据访问。补充GPU指令计数分析与指令集在一个编译器生成的内核澄清。VTune中的GPU分析是为OpenCL语言和运行时开发的，但是最新的SYCL语言及其扩展Data Parallel c++以及Level Zero运行时也得到支持，运行在所有英特尔GPU上，从Gen9 HD Graphics到英特尔Iris Xe Graphics(一个独立的GPU卡)。会议将介绍不同GPU架构的性能分析结果。VTune Profiler for gpu是一个新扩展的工具集，随着英特尔新加速架构的开发，它正在积极开发。工具中不断出现新的特性和分析概念，以满足软件架构师和开发人员的需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Workshop on OpenCL

自引率

0.00%

发文量