A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

GPGPU-3 Pub Date : 2010-03-14 DOI:10.1145/1735688.1735698

Allen Leung, Nicolas Vasilache, Benoît Meister, M. Baskaran, David Wohlford, C. Bastoul, R. Lethin

{"title":"A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction","authors":"Allen Leung, Nicolas Vasilache, Benoît Meister, M. Baskaran, David Wohlford, C. Bastoul, R. Lethin","doi":"10.1145/1735688.1735698","DOIUrl":null,"url":null,"abstract":"Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability.\n This paper discusses the implementation, in the R-Stream compiler, of a source to source mapping pathway from a high-level, textbook-style algorithm expression method in ANSI C, to multi-GPGPU accelerated computers. The compiler performs hierarchical decomposition and parallelization of the algorithm between and across host, multiple GPGPUs, and within-GPU. The semantic transformations are expressed within the polyhedral model, including optimization of integrated parallelization, locality, and contiguity tradeoffs. Hierarchical tiling is performed. Communication and synchronizations operations at multiple levels are generated automatically. The resulting mapping is currently emitted in the CUDA programming language.\n The GPU backend adds to the range of hardware and accelerator targets for R-Stream and indicates the potential for performance portability of single sources across multiple hardware targets.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"87","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GPGPU-3","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1735688.1735698","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 87

Abstract

Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability. This paper discusses the implementation, in the R-Stream compiler, of a source to source mapping pathway from a high-level, textbook-style algorithm expression method in ANSI C, to multi-GPGPU accelerated computers. The compiler performs hierarchical decomposition and parallelization of the algorithm between and across host, multiple GPGPUs, and within-GPU. The semantic transformations are expressed within the polyhedral model, including optimization of integrated parallelization, locality, and contiguity tradeoffs. Hierarchical tiling is performed. Communication and synchronizations operations at multiple levels are generated automatically. The resulting mapping is currently emitted in the CUDA programming language. The GPU backend adds to the range of hardware and accelerator targets for R-Stream and indicates the potential for performance portability of single sources across multiple hardware targets.

查看原文本刊更多论文

从可移植的高级编程抽象实现多gpgpu加速计算机的映射路径

GPGPU的程序员面临着快速变化的编程抽象、执行模型和硬件实现的基础。通过对应用程序内核、编程语言和GPU硬件实例的特定组合的大量演示，已经确定，与通用处理器相比，在价格/性能和能源/性能方面有可能实现显着改进。但是这些演示都是程序员大量投入工作的结果，为了实现性能可移植性，每个新的GPU硬件架构都可能重复这种工作。本文讨论了在R-Stream编译器中实现一个源到源的映射路径，从一个高级的、教科书式的ANSI C算法表达式方法到多gpgpu加速计算机。编译器在主机之间和跨主机、多个gpgpu以及gpu内执行算法的分层分解和并行化。语义转换在多面体模型中表示，包括集成并行化、局部性和邻近权衡的优化。执行分层平铺。自动生成多个级别的通信和同步操作。结果映射是当前在CUDA编程语言中发出的。GPU后端增加了R-Stream的硬件和加速器目标的范围，并表明了跨多个硬件目标的单个源的性能可移植性的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

GPGPU-3

自引率

0.00%

发文量