Multifrontal Factorization of Sparse SPD Matrices on GPUs

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.44

Thomas George, Vaibhav Saxena, Anshul Gupta, Amik Singh, Anamitra R. Choudhury

{"title":"Multifrontal Factorization of Sparse SPD Matrices on GPUs","authors":"Thomas George, Vaibhav Saxena, Anshul Gupta, Amik Singh, Anamitra R. Choudhury","doi":"10.1109/IPDPS.2011.44","DOIUrl":null,"url":null,"abstract":"Solving large sparse linear systems is often the most computationally intensive component of many scientific computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-efficiency, and the potential to achieve significant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored sufficiently due to the complexity involved in an efficient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classifier. We estimate the classifier parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.44","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 42

Abstract

Solving large sparse linear systems is often the most computationally intensive component of many scientific computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-efficiency, and the potential to achieve significant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored sufficiently due to the complexity involved in an efficient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classifier. We estimate the classifier parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.

查看原文本刊更多论文

gpu上稀疏SPD矩阵的多额分解

求解大型稀疏线性系统通常是许多科学计算应用中计算量最大的部分。在过去，稀疏多额直接分解已经被证明可以扩展到专用超级计算机上的数千个处理器，从而大大减少了计算时间。近年来，一种基于gpu的替代计算范式获得了突出的地位，这主要是由于它的可负担性、功率效率，以及在常规和结构化并行应用程序上相对于桌面性能实现显著加速的潜力。然而，由于高效实现的复杂性和对GPU利用率低的担忧，GPU上的稀疏矩阵分解尚未得到充分的探索。在本文中，我们提出了一种基于明智地利用主机CPU和GPU的处理能力来加速稀疏多额分解的自适应混合方法。我们提出了在主机CPU和GPU之间分配和调度工作负载的四种不同策略，并提出了一种针对稀疏Cholesky分解的每个步骤选择适当策略的运行时机制。该机制依赖于基于将最佳策略预测器建模为参数分类器的自动调优。我们从可用的经验计算时间数据中估计分类器参数，使期望的计算时间最小化。这种方法很容易适用于为不同的CPU-GPU组合使用当前或扩展的策略集，以及为CPU和GPU使用不同的密集内核组合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量