Resource conscious reuse-driven tiling for GPUs

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI:10.1145/2967938.2967967

P. Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, L. Pouchet, A. Rountev, P. Sadayappan

{"title":"Resource conscious reuse-driven tiling for GPUs","authors":"P. Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, L. Pouchet, A. Rountev, P. Sadayappan","doi":"10.1145/2967938.2967967","DOIUrl":null,"url":null,"abstract":"Computations involving successive application of 3D stencil operators are widely used in many application domains, such as image processing, computational electromagnetics, seismic processing, and climate modeling. Enhancement of temporal and spatial locality via tiling is generally required in order to overcome performance bottlenecks due to limited bandwidth to global memory on GPUs. However, the low shared memory capacity on current GPU architectures makes effective tiling for 3D stencils very challenging - several previous domain-specific compilers for stencils have demonstrated very high performance for 2D stencils, but much lower performance on 3D stencils. In this paper, we develop an effective resource-constraint-driven approach for automated GPU code generation for stencils. We present a fusion technique that judiciously fuses stencil computations to minimize data movement, while controlling computational redundancy and maximizing resource usage. The fusion model subsumes time tiling of iterated stencils, and can be easily adapted to different GPU architectures. We integrate the fusion model into a code generator that makes effective use of scarce shared memory and registers to achieve high performance. The effectiveness of the automated model-driven code generator is demonstrated through experimental results on a number of benchmarks, comparing against various previously developed GPU code generators.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2967967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

Computations involving successive application of 3D stencil operators are widely used in many application domains, such as image processing, computational electromagnetics, seismic processing, and climate modeling. Enhancement of temporal and spatial locality via tiling is generally required in order to overcome performance bottlenecks due to limited bandwidth to global memory on GPUs. However, the low shared memory capacity on current GPU architectures makes effective tiling for 3D stencils very challenging - several previous domain-specific compilers for stencils have demonstrated very high performance for 2D stencils, but much lower performance on 3D stencils. In this paper, we develop an effective resource-constraint-driven approach for automated GPU code generation for stencils. We present a fusion technique that judiciously fuses stencil computations to minimize data movement, while controlling computational redundancy and maximizing resource usage. The fusion model subsumes time tiling of iterated stencils, and can be easily adapted to different GPU architectures. We integrate the fusion model into a code generator that makes effective use of scarce shared memory and registers to achieve high performance. The effectiveness of the automated model-driven code generator is demonstrated through experimental results on a number of benchmarks, comparing against various previously developed GPU code generators.

查看原文本刊更多论文

面向 GPU 的资源再利用驱动平铺技术

涉及连续应用三维模版算子的计算广泛应用于许多应用领域，如图像处理、计算电磁学、地震处理和气候建模。由于 GPU 上的全局内存带宽有限，为了克服性能瓶颈，通常需要通过平铺来增强时间和空间局部性。然而，由于当前 GPU 架构的共享内存容量较低，因此对三维模版进行有效的平铺处理非常具有挑战性--之前几种针对特定领域的模版编译器已经证明，二维模版的性能非常高，但三维模版的性能却低得多。在本文中，我们开发了一种有效的资源约束驱动方法，用于自动生成模板的 GPU 代码。我们提出了一种融合技术，它能明智地融合模版计算，最大限度地减少数据移动，同时控制计算冗余并最大限度地提高资源利用率。融合模型包含迭代模版的时间平铺，可轻松适应不同的 GPU 架构。我们将融合模型集成到代码生成器中，从而有效利用稀缺的共享内存和寄存器来实现高性能。通过在一些基准测试中的实验结果，并与之前开发的各种 GPU 代码生成器进行比较，证明了自动模型驱动代码生成器的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

自引率

0.00%

发文量