Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI:10.1109/ICPP.2016.72

Xiaonan Tian, Dounia Khaldi, Deepak Eachempati, Rengan Xu, B. Chapman

{"title":"Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations","authors":"Xiaonan Tian, Dounia Khaldi, Deepak Eachempati, Rengan Xu, B. Chapman","doi":"10.1109/ICPP.2016.72","DOIUrl":null,"url":null,"abstract":"Using compiler directives to program accelerator-based systems through APIs such as OpenACC or OpenMP has increasingly gained popularity due to the portability and productivity advantages it offers. However, when comparing the performance typically achieved to what lower-level programming interfaces such as CUDA or OpenCL provides, directive-based approaches may entail a significant performance penalty. To support massively parallel computations, accelerators such as GPGPUs offer an expansive set of registers, larger than even the L1 cache, to hold the temporary state of each thread. Scalar variables are the mostly likely candidates to be assigned to these registers by the compiler. Hence, scalar replacement is a key enabling optimization for effectively improving the utilization of register files on accelerator devices and thereby substantially reducing the cost of memory operations. However, the aggressive application of scalar replacement may require a large number of registers, limiting the application of this technique unless mitigating approaches such as those described in this paper are taken. In this paper, we propose solutions to optimize the register usage within offloaded computations using OpenACC directives. We first present a compiler optimization called SAFARA that extends the classical scalar replacement algorithm to improve register file utilization on GPUs. Moreover, we extend the OpenACC interface by providing new clauses, namely dim and small, that will reduce the number of scalars to replace. SAFARA prioritizes the most beneficial data for allocation in registers based on frequency of use and also memory access latency. It also uses a static feedback strategy to retrieve low-level register information in order to guide the compiler in carrying out the scalar replacement transformation. Then, the new clauses we propose will extremely reduce the number of scalars, eliminating the need for more registers. We evaluate SAFARA and the new clauses using SPEC and NAS OpenACC benchmarks, our results suggest that these approaches will be effective for improving overall performance of code executing on GPUs. We got up to 2.5 speedup running NAS and 2.08 speedup while running SPEC benchmarks.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 45th International Conference on Parallel Processing (ICPP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2016.72","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Using compiler directives to program accelerator-based systems through APIs such as OpenACC or OpenMP has increasingly gained popularity due to the portability and productivity advantages it offers. However, when comparing the performance typically achieved to what lower-level programming interfaces such as CUDA or OpenCL provides, directive-based approaches may entail a significant performance penalty. To support massively parallel computations, accelerators such as GPGPUs offer an expansive set of registers, larger than even the L1 cache, to hold the temporary state of each thread. Scalar variables are the mostly likely candidates to be assigned to these registers by the compiler. Hence, scalar replacement is a key enabling optimization for effectively improving the utilization of register files on accelerator devices and thereby substantially reducing the cost of memory operations. However, the aggressive application of scalar replacement may require a large number of registers, limiting the application of this technique unless mitigating approaches such as those described in this paper are taken. In this paper, we propose solutions to optimize the register usage within offloaded computations using OpenACC directives. We first present a compiler optimization called SAFARA that extends the classical scalar replacement algorithm to improve register file utilization on GPUs. Moreover, we extend the OpenACC interface by providing new clauses, namely dim and small, that will reduce the number of scalars to replace. SAFARA prioritizes the most beneficial data for allocation in registers based on frequency of use and also memory access latency. It also uses a static feedback strategy to retrieve low-level register information in order to guide the compiler in carrying out the scalar replacement transformation. Then, the new clauses we propose will extremely reduce the number of scalars, eliminating the need for more registers. We evaluate SAFARA and the new clauses using SPEC and NAS OpenACC benchmarks, our results suggest that these approaches will be effective for improving overall performance of code executing on GPUs. We got up to 2.5 speedup running NAS and 2.08 speedup while running SPEC benchmarks.

查看原文本刊更多论文

优化GPU寄存器使用:扩展到OpenACC和编译器优化

使用编译器指令通过api(如OpenACC或OpenMP)对基于加速器的系统进行编程，由于它提供了可移植性和生产力优势，因此越来越受欢迎。然而，当将通常实现的性能与CUDA或OpenCL等低级编程接口提供的性能进行比较时，基于指令的方法可能会带来显着的性能损失。为了支持大规模并行计算，gpgpu等加速器提供了一组扩展寄存器，甚至比L1缓存还大，用于保存每个线程的临时状态。标量变量是编译器最有可能赋值给这些寄存器的候选者。因此，标量替换是实现优化的关键，可以有效地提高加速器设备上寄存器文件的利用率，从而大大降低内存操作的成本。然而，标量替换的积极应用可能需要大量的寄存器，限制了该技术的应用，除非采取诸如本文中描述的缓解方法。在本文中，我们提出了使用OpenACC指令在卸载计算中优化寄存器使用的解决方案。我们首先提出了一个名为SAFARA的编译器优化，它扩展了经典的标量替换算法，以提高gpu上寄存器文件的利用率。此外，我们通过提供新的子句来扩展OpenACC接口，即dim和small，这将减少要替换的标量的数量。SAFARA根据使用频率和内存访问延迟对寄存器中最有利的数据进行优先级分配。它还使用静态反馈策略来检索低级寄存器信息，以便指导编译器执行标量替换转换。然后，我们提出的新条款将极大地减少标量的数量，消除了对更多寄存器的需要。我们使用SPEC和NAS OpenACC基准评估SAFARA和新条款，我们的结果表明，这些方法将有效地提高在gpu上执行的代码的整体性能。我们在运行NAS测试时获得了2.5的加速，在运行SPEC基准测试时获得了2.08的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 45th International Conference on Parallel Processing (ICPP)

自引率

0.00%

发文量

文献相关原料

公司名称	产品信息	采购帮参考价格