A Practical Performance Model for Compute and Memory Bound GPU Kernels

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Pub Date : 2015-03-04 DOI:10.1109/PDP.2015.51

E. Konstantinidis, Y. Cotronis

{"title":"A Practical Performance Model for Compute and Memory Bound GPU Kernels","authors":"E. Konstantinidis, Y. Cotronis","doi":"10.1109/PDP.2015.51","DOIUrl":null,"url":null,"abstract":"Performance prediction of GPU kernels is generally a tedious procedure with unpredictable results. In this paper, we provide a practical model for estimating performance of CUDA kernels on GPU hardware in an automated manner. First, we propose the quadrant-split model, an alternative of the roofline visual performance model, which provides insight on the performance limiting factors of multiple devices with different compute-memory bandwidth ratios with respect to a particular kernel. We elaborate on the compute-memory bound characteristic of kernels. In addition, a micro-benchmark program was developed exposing the peak compute and memory transfer performance using variable operation intensity. Experimental results of executions on different GPUs are presented. In the proposed performance prediction procedure, a set of kernel features is extracted through an automated profiling execution which records a set of significant kernel metrics. Additionally, a small set of device features for the target GPU is generated using micro-benchmarking and architecture specifications. In conjunction of kernel and device features we determine the performance limiting factor and we generate an estimation of the kernel's execution time. We performed experiments on DAXPY, DGEMM, FFT and stencil computation kernels using 4 GPUs and we showed an absolute error in predictions of 10.1% in the average case and 25.8% in the worst case.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"193 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2015.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

Performance prediction of GPU kernels is generally a tedious procedure with unpredictable results. In this paper, we provide a practical model for estimating performance of CUDA kernels on GPU hardware in an automated manner. First, we propose the quadrant-split model, an alternative of the roofline visual performance model, which provides insight on the performance limiting factors of multiple devices with different compute-memory bandwidth ratios with respect to a particular kernel. We elaborate on the compute-memory bound characteristic of kernels. In addition, a micro-benchmark program was developed exposing the peak compute and memory transfer performance using variable operation intensity. Experimental results of executions on different GPUs are presented. In the proposed performance prediction procedure, a set of kernel features is extracted through an automated profiling execution which records a set of significant kernel metrics. Additionally, a small set of device features for the target GPU is generated using micro-benchmarking and architecture specifications. In conjunction of kernel and device features we determine the performance limiting factor and we generate an estimation of the kernel's execution time. We performed experiments on DAXPY, DGEMM, FFT and stencil computation kernels using 4 GPUs and we showed an absolute error in predictions of 10.1% in the average case and 25.8% in the worst case.

查看原文本刊更多论文

计算和内存绑定GPU内核的实用性能模型

GPU内核的性能预测通常是一个繁琐的过程，结果不可预测。在本文中，我们提供了一个实用的模型，以自动化的方式估计GPU硬件上CUDA内核的性能。首先，我们提出了象限分割模型，这是屋顶线视觉性能模型的替代方案，它提供了对具有不同计算内存带宽比的多个设备相对于特定内核的性能限制因素的见解。我们详细阐述了核函数的计算-存储边界特性。此外，还开发了一个微基准程序，揭示了使用可变操作强度时的峰值计算和内存传输性能。给出了在不同gpu上执行的实验结果。在提出的性能预测过程中，通过自动分析执行提取一组内核特征，该分析执行记录了一组重要的内核指标。此外，使用微基准测试和架构规范为目标GPU生成一小组设备特性。结合内核和设备特性，我们确定了性能限制因素，并生成了对内核执行时间的估计。我们使用4个gpu在DAXPY、DGEMM、FFT和模板计算内核上进行了实验，我们发现在平均情况下预测的绝对误差为10.1%，在最坏情况下预测的绝对误差为25.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

自引率

0.00%

发文量