Automatic optimization of thread-coarsening for graphics processors

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI:10.1145/2628071.2628087

A. Magni, Christophe Dubach, M. O’Boyle

{"title":"Automatic optimization of thread-coarsening for graphics processors","authors":"A. Magni, Christophe Dubach, M. O’Boyle","doi":"10.1145/2628071.2628087","DOIUrl":null,"url":null,"abstract":"OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to manually tune applications for each specific device, preventing effective portability. We target a compiler transformation specific for data-parallel languages: thread-coarsening and show it can improve performance across different GPU devices. We then address the problem of selecting the best value for the coarsening factor parameter, i.e., deciding how many threads to merge together. We experimentally show that this is a hard problem to solve: good configurations are difficult to find and naive coarsening in fact leads to substantial slowdowns. We propose a solution based on a machine-learning model that predicts the best coarsening factor using kernel-function static features. The model automatically specializes to the different architectures considered. We evaluate our approach on 17 benchmarks on four devices: two Nvidia GPUs and two different generations of AMD GPUs. Using our technique, we achieve speedups between 1.11× and 1.33× on average.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"71","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2628087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 71

Abstract

OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to manually tune applications for each specific device, preventing effective portability. We target a compiler transformation specific for data-parallel languages: thread-coarsening and show it can improve performance across different GPU devices. We then address the problem of selecting the best value for the coarsening factor parameter, i.e., deciding how many threads to merge together. We experimentally show that this is a hard problem to solve: good configurations are difficult to find and naive coarsening in fact leads to substantial slowdowns. We propose a solution based on a machine-learning model that predicts the best coarsening factor using kernel-function static features. The model automatically specializes to the different architectures considered. We evaluate our approach on 17 benchmarks on four devices: two Nvidia GPUs and two different generations of AMD GPUs. Using our technique, we achieve speedups between 1.11× and 1.33× on average.

查看原文本刊更多论文

图形处理器线程粗化的自动优化

OpenCL的设计目的是实现跨不同厂商多核设备的功能可移植性。然而，缺乏单一的跨目标优化编译器严重限制了OpenCL程序的性能可移植性。程序员需要为每个特定设备手动调整应用程序，这妨碍了有效的可移植性。我们的目标是针对数据并行语言的编译器转换:线程粗化，并表明它可以提高不同GPU设备的性能。然后，我们解决选择粗化因子参数的最佳值的问题，即决定合并多少线程。我们的实验表明，这是一个很难解决的问题:很难找到好的配置，而朴素的粗化实际上会导致大幅的减速。我们提出了一种基于机器学习模型的解决方案，该模型使用核函数静态特征预测最佳粗化因子。该模型自动专门化到所考虑的不同体系结构。我们在四个设备上的17个基准测试中评估了我们的方法:两个Nvidia gpu和两个不同代的AMD gpu。使用我们的技术，我们实现了平均在1.11到1.33倍之间的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量