gpu上嵌套并行模式的位置感知映射

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture Pub Date : 2014-12-13 DOI:10.1109/MICRO.2014.23

HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, K. Olukotun

{"title":"gpu上嵌套并行模式的位置感知映射","authors":"HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, K. Olukotun","doi":"10.1109/MICRO.2014.23","DOIUrl":null,"url":null,"abstract":"Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"83 1","pages":"63-74"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":"{\"title\":\"Locality-Aware Mapping of Nested Parallel Patterns on GPUs\",\"authors\":\"HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, K. Olukotun\",\"doi\":\"10.1109/MICRO.2014.23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.\",\"PeriodicalId\":6591,\"journal\":{\"name\":\"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture\",\"volume\":\"83 1\",\"pages\":\"63-74\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"47\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MICRO.2014.23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

摘要

最近的工作是探索使用高级语言来提高gpu上程序员的工作效率。这些语言通常使用高级计算模式(例如Map和Reduce)来编码并行语义，以便自动编译到GPU内核。然而，当模式嵌套时，有效地将模式映射到GPU硬件的问题变得更加困难，这在重要的应用程序中很常见。为了解决这个问题，我们提出了一个通用的分析框架，用于自动有效地将嵌套模式映射到gpu上。分析将嵌套模式映射到逻辑多维域，并参数化每个维度中的块大小和并行度。然后，我们添加特定于gpu的硬约束和软约束来修剪可能映射的空间并选择最佳映射。我们还执行由映射引导的多个编译器优化，以避免动态内存分配，并自动利用GPU内核中的共享内存。我们将自动选择的映射与手动优化的实现在多个基准测试上的性能进行了比较，发现8个基准测试中有7个的平均性能差距为24%。此外，我们的映射策略比简单的1D映射和现有的2D映射分别高出28.6倍和9.6倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

自引率

0.00%

发文量