Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI:10.1109/IPDPSW.2014.109

Simplice Donfack, S. Tomov, J. Dongarra

{"title":"Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs","authors":"Simplice Donfack, S. Tomov, J. Dongarra","doi":"10.1109/IPDPSW.2014.109","DOIUrl":null,"url":null,"abstract":"Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoidsdata transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4× compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2014.109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoidsdata transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4× compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements.

查看原文本刊更多论文

动态平衡同步-避免多核和gpu的LU分解

图形处理单元(gpu)在科学和数值领域带来了巨大的性能改进。我们提出了一种高效的CPU/GPU混合方法，该方法可移植，动态有效地平衡CPU和GPU之间的工作负载，并避免了数值算法中经常出现的数据传输瓶颈。我们的方法确定在执行之前分配给cpu的初始工作量，然后在执行期间动态平衡工作负载。然后，我们提出了一个理论模型来指导cpu初始工作量的选择。我们的模型的验证允许我们的方法使用底层机器的制造商特征自适应任何架构。我们举例说明了LU分解的方法。对于这种情况，我们证明了将我们的方法与通信避免LU算法结合使用是有效的。例如，我们在24核AMD opteron 6172上的实验表明，通过添加一个GPU (Tesla S2050)，与使用24核的MKL中的相应例程相比，我们将LU加速到2.4倍。与MAGMA的比较也显示出显著的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE International Parallel & Distributed Processing Symposium Workshops

自引率

0.00%

发文量