Exploration of automatic optimization for CUDA programming

2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing Pub Date : 2012-12-01 DOI:10.1109/PDGC.2012.6449791

M. Al-Mouhamed, A. ul Hassan Khan

{"title":"Exploration of automatic optimization for CUDA programming","authors":"M. Al-Mouhamed, A. ul Hassan Khan","doi":"10.1109/PDGC.2012.6449791","DOIUrl":null,"url":null,"abstract":"Graphic processing Units (GPUs) are gaining ground in high-performance computing. CUDA (an extension to C) is most widely used parallel programming framework for general purpose GPU computations. However, the task of writing optimized CUDA program is complex even for experts. We present a method for restructuring loops into an optimized CUDA kernels based on a 3-step algorithm which are loop tiling, coalesced memory access, and resource optimization. We also establish the relationships between the influencing parameters and propose a method for finding possible tiling solutions with coalesced memory access that best meets the identified constraints. We also present a simplified algorithm for restructuring loops and rewrite them as an efficient CUDA Kernel. The execution model of synthesized kernel consists of uniformly distributing the kernel threads to keep all cores busy while transferring a tailored data locality which is accessed using coalesced pattern to amortize the long latency of the secondary memory. In the evaluation, we implement some simple applications using the proposed restructuring strategy and evaluate the performance in terms of execution time and GPU throughput.","PeriodicalId":166718,"journal":{"name":"2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDGC.2012.6449791","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Graphic processing Units (GPUs) are gaining ground in high-performance computing. CUDA (an extension to C) is most widely used parallel programming framework for general purpose GPU computations. However, the task of writing optimized CUDA program is complex even for experts. We present a method for restructuring loops into an optimized CUDA kernels based on a 3-step algorithm which are loop tiling, coalesced memory access, and resource optimization. We also establish the relationships between the influencing parameters and propose a method for finding possible tiling solutions with coalesced memory access that best meets the identified constraints. We also present a simplified algorithm for restructuring loops and rewrite them as an efficient CUDA Kernel. The execution model of synthesized kernel consists of uniformly distributing the kernel threads to keep all cores busy while transferring a tailored data locality which is accessed using coalesced pattern to amortize the long latency of the secondary memory. In the evaluation, we implement some simple applications using the proposed restructuring strategy and evaluate the performance in terms of execution time and GPU throughput.

查看原文本刊更多论文

CUDA编程的自动优化探索

图形处理单元(gpu)正在高性能计算领域取得进展。CUDA (C语言的一种扩展)是最广泛用于通用GPU计算的并行编程框架。然而，编写优化CUDA程序的任务即使对专家来说也是复杂的。我们提出了一种基于循环平铺、合并内存访问和资源优化三步算法将循环重组为优化的CUDA内核的方法。我们还建立了影响参数之间的关系，并提出了一种方法来寻找具有合并内存访问的可能平铺解决方案，该方法最能满足所识别的约束。我们还提出了一种简化的循环重组算法，并将其重写为高效的CUDA内核。合成内核的执行模型包括:均匀分布内核线程，使所有内核都处于繁忙状态，同时传输定制的数据位置，并使用合并模式访问数据位置，以抵消辅助存储器的长延迟。在评估中，我们使用提出的重构策略实现了一些简单的应用程序，并从执行时间和GPU吞吐量方面评估了性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing

自引率

0.00%

发文量