Accelerating a C++ CFD Code with OpenACC

2014 First Workshop on Accelerator Programming using Directives Pub Date : 2014-11-16 DOI:10.1109/WACCPD.2014.11

J. Kraus, Michael Schlottke, A. Adinetz, D. Pleiter

{"title":"Accelerating a C++ CFD Code with OpenACC","authors":"J. Kraus, Michael Schlottke, A. Adinetz, D. Pleiter","doi":"10.1109/WACCPD.2014.11","DOIUrl":null,"url":null,"abstract":"Todays HPC systems are increasingly utilizing accelerators to lower time to solution for their users and reduce power consumption. To utilize the higher performance and energy efficiency of these accelerators, application developers need to rewrite at least parts of their codes. Taking the C++ flow solver ZFS as an example, we show that the directive-based programming model allows one to achieve good performance with reasonable effort, even for mature codes with many lines of code. Using OpenACC directives permitted us to incrementally accelerate ZFS, focusing on the parts of the program that are relevant for the problem at hand. The two new OpenACC 2.0 features, unstructured data regions and atomics, are required for this. OpenACC's interoperability with existing GPU libraries via the host_data use_device construct allowed to use CUDAaware MPI to achieve multi-GPU scalability comparable to the CPU version of ZFS. Like many other codes, the data structures of ZFS have been designed with traditional CPUs and their relatively large private caches in mind. This leads to suboptimal memory access patterns on accelerators, such as GPUs. We show how the texture cache on NVIDIA GPUs can be used to minimize the performance impact of these suboptimal patterns without writing platform specific code. For the kernel most affected by the memory access pattern, we compare the initial array of structures memory layout with a structure of arrays layout.","PeriodicalId":179664,"journal":{"name":"2014 First Workshop on Accelerator Programming using Directives","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 First Workshop on Accelerator Programming using Directives","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACCPD.2014.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

Todays HPC systems are increasingly utilizing accelerators to lower time to solution for their users and reduce power consumption. To utilize the higher performance and energy efficiency of these accelerators, application developers need to rewrite at least parts of their codes. Taking the C++ flow solver ZFS as an example, we show that the directive-based programming model allows one to achieve good performance with reasonable effort, even for mature codes with many lines of code. Using OpenACC directives permitted us to incrementally accelerate ZFS, focusing on the parts of the program that are relevant for the problem at hand. The two new OpenACC 2.0 features, unstructured data regions and atomics, are required for this. OpenACC's interoperability with existing GPU libraries via the host_data use_device construct allowed to use CUDAaware MPI to achieve multi-GPU scalability comparable to the CPU version of ZFS. Like many other codes, the data structures of ZFS have been designed with traditional CPUs and their relatively large private caches in mind. This leads to suboptimal memory access patterns on accelerators, such as GPUs. We show how the texture cache on NVIDIA GPUs can be used to minimize the performance impact of these suboptimal patterns without writing platform specific code. For the kernel most affected by the memory access pattern, we compare the initial array of structures memory layout with a structure of arrays layout.

查看原文本刊更多论文

使用OpenACC加速c++ CFD代码

如今，HPC系统越来越多地利用加速器来缩短用户解决方案的时间并降低功耗。为了利用这些加速器的更高性能和能效，应用程序开发人员至少需要重写部分代码。以c++流求解器ZFS为例，我们展示了基于指令的编程模型允许人们通过合理的努力获得良好的性能，甚至对于具有许多行代码的成熟代码也是如此。使用OpenACC指令允许我们逐步加速ZFS，专注于与手头问题相关的程序部分。这需要两个新的OpenACC 2.0特性，非结构化数据区域和原子。OpenACC通过host_data use_device构造与现有GPU库的互操作性允许使用CUDAaware MPI来实现与ZFS的CPU版本相当的多GPU可扩展性。与许多其他代码一样，ZFS的数据结构在设计时考虑了传统cpu及其相对较大的私有缓存。这导致加速器(如gpu)上的内存访问模式不是最优的。我们展示了如何使用NVIDIA gpu上的纹理缓存来最小化这些次优模式的性能影响，而无需编写特定平台的代码。对于受内存访问模式影响最大的内核，我们比较了初始结构数组的内存布局和数组结构的内存布局。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 First Workshop on Accelerator Programming using Directives

自引率

0.00%

发文量