使用OpenCL实现GPU应用程序的自动并行化

2015 Asia-Pacific Conference on Computer Aided System Engineering Pub Date : 2015-07-14 DOI:10.1109/APCASE.2015.56

L. Solano-Quinde, Brett M. Bode, Arun Kumar Somani

{"title":"使用OpenCL实现GPU应用程序的自动并行化","authors":"L. Solano-Quinde, Brett M. Bode, Arun Kumar Somani","doi":"10.1109/APCASE.2015.56","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) have been successfully used to accelerate scientific applications due to their computation power and the availability of programming languages that make more approachable writing scientific applications for GPUs. However, since the programming model of GPUs requires offloading all the data to the GPU memory, the memory footprint of the application is limited to the size of the GPU memory. Multi-GPU systems can make memory limited problems tractable by parallelizing the computation and data among the available GPUs. Parallelizing applications written for running on single-GPU systems can be done (i) at runtime through an environment that captures the memory operations and kernel calls and distributes among the available GPUs, and (ii) at compile time through a pre-compiler that transforms the application for decomposing the data and computation among the available GPUs. In this paper we propose a framework and implement a tool that transforms an OpenCL application written to run on single-GPU systems into one that runs on multi-GPU systems. Based on data dependencies and data usage analysis, the application is transformed to decompose data and computation among the available GPUs. To reduce the data transfer overhead, computation-communication overlapping techniques are utilized. We tested our tool using two applications with different data transfer requirements, for the application with no data transfer requirements, a linear speedup is achieved, while for the application with data transfers, the computation-communication overlapping reduces the communication overhead by 40%.","PeriodicalId":235698,"journal":{"name":"2015 Asia-Pacific Conference on Computer Aided System Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Automatic Parallelization of GPU Applications Using OpenCL\",\"authors\":\"L. Solano-Quinde, Brett M. Bode, Arun Kumar Somani\",\"doi\":\"10.1109/APCASE.2015.56\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics Processing Units (GPUs) have been successfully used to accelerate scientific applications due to their computation power and the availability of programming languages that make more approachable writing scientific applications for GPUs. However, since the programming model of GPUs requires offloading all the data to the GPU memory, the memory footprint of the application is limited to the size of the GPU memory. Multi-GPU systems can make memory limited problems tractable by parallelizing the computation and data among the available GPUs. Parallelizing applications written for running on single-GPU systems can be done (i) at runtime through an environment that captures the memory operations and kernel calls and distributes among the available GPUs, and (ii) at compile time through a pre-compiler that transforms the application for decomposing the data and computation among the available GPUs. In this paper we propose a framework and implement a tool that transforms an OpenCL application written to run on single-GPU systems into one that runs on multi-GPU systems. Based on data dependencies and data usage analysis, the application is transformed to decompose data and computation among the available GPUs. To reduce the data transfer overhead, computation-communication overlapping techniques are utilized. We tested our tool using two applications with different data transfer requirements, for the application with no data transfer requirements, a linear speedup is achieved, while for the application with data transfers, the computation-communication overlapping reduces the communication overhead by 40%.\",\"PeriodicalId\":235698,\"journal\":{\"name\":\"2015 Asia-Pacific Conference on Computer Aided System Engineering\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 Asia-Pacific Conference on Computer Aided System Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APCASE.2015.56\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 Asia-Pacific Conference on Computer Aided System Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APCASE.2015.56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

图形处理单元(gpu)已经成功地用于加速科学应用程序，因为它们的计算能力和编程语言的可用性使得为gpu编写科学应用程序更容易。然而，由于GPU的编程模型需要将所有数据卸载到GPU内存，因此应用程序的内存占用被限制为GPU内存的大小。多gpu系统通过在可用的gpu之间并行化计算和数据，使内存有限的问题易于处理。为在单gpu系统上运行而编写的并行应用程序可以(i)在运行时通过捕获内存操作和内核调用并在可用gpu之间分配的环境完成，以及(ii)在编译时通过预编译器转换应用程序以在可用gpu之间分解数据和计算。在本文中，我们提出了一个框架，并实现了一个工具，将编写的OpenCL应用程序转换为运行在单gpu系统上的OpenCL应用程序，以运行在多gpu系统上。基于数据依赖关系和数据使用分析，将应用程序转换为在可用gpu之间分解数据和计算。为了减少数据传输开销，采用了计算通信重叠技术。我们使用两个具有不同数据传输需求的应用程序测试了我们的工具，对于没有数据传输需求的应用程序，实现了线性加速，而对于具有数据传输的应用程序，计算-通信重叠减少了40%的通信开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic Parallelization of GPU Applications Using OpenCL

Graphics Processing Units (GPUs) have been successfully used to accelerate scientific applications due to their computation power and the availability of programming languages that make more approachable writing scientific applications for GPUs. However, since the programming model of GPUs requires offloading all the data to the GPU memory, the memory footprint of the application is limited to the size of the GPU memory. Multi-GPU systems can make memory limited problems tractable by parallelizing the computation and data among the available GPUs. Parallelizing applications written for running on single-GPU systems can be done (i) at runtime through an environment that captures the memory operations and kernel calls and distributes among the available GPUs, and (ii) at compile time through a pre-compiler that transforms the application for decomposing the data and computation among the available GPUs. In this paper we propose a framework and implement a tool that transforms an OpenCL application written to run on single-GPU systems into one that runs on multi-GPU systems. Based on data dependencies and data usage analysis, the application is transformed to decompose data and computation among the available GPUs. To reduce the data transfer overhead, computation-communication overlapping techniques are utilized. We tested our tool using two applications with different data transfer requirements, for the application with no data transfer requirements, a linear speedup is achieved, while for the application with data transfers, the computation-communication overlapping reduces the communication overhead by 40%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 Asia-Pacific Conference on Computer Aided System Engineering

自引率

0.00%

发文量