GPU-based multifrontal optimizing method in sparse Cholesky factorization

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2015-07-27 DOI:10.1109/ASAP.2015.7245714

Ran Zheng, Wei Wang, Hai Jin, Song Wu, Yong Chen, Han Jiang

{"title":"GPU-based multifrontal optimizing method in sparse Cholesky factorization","authors":"Ran Zheng, Wei Wang, Hai Jin, Song Wu, Yong Chen, Han Jiang","doi":"10.1109/ASAP.2015.7245714","DOIUrl":null,"url":null,"abstract":"In many scientific computing applications, sparse Cholesky factorization is used to solve large sparse linear equations in distributed environment. GPU computing is a new way to solve the problem. However, sparse Cholesky factorization on GPU is hardly to achieve excellent performance due to the structure irregularity of matrix and the low GPU resource utilization. A hybrid CPU-GPU implementation of sparse Cholesky factorization is proposed based on multifrontal method. A large sparse coefficient matrix is decomposed into a series of small dense matrices (frontal matrices) in the method, and then multiple GEMM (General Matrix-matrix Multiplication) operations are computed. GEMMs are the main operations in sparse Cholesky factorization, but they are hardly to perform better in parallel on GPU. In order to improve the performance, the scheme of multiple task queues is adopted when performing multiple GEMMs parallelized with multifrontal method; all GEMM tasks are scheduled dynamically on GPU and CPU based on computation scales for load balance and computing-time reduction. Experimental results show that the approach can outperform the implementations of BLAS and cuBLAS, achieving up to 3.15× and 1.98× speedup, respectively.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"36 1","pages":"90-97"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.2015.7245714","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

In many scientific computing applications, sparse Cholesky factorization is used to solve large sparse linear equations in distributed environment. GPU computing is a new way to solve the problem. However, sparse Cholesky factorization on GPU is hardly to achieve excellent performance due to the structure irregularity of matrix and the low GPU resource utilization. A hybrid CPU-GPU implementation of sparse Cholesky factorization is proposed based on multifrontal method. A large sparse coefficient matrix is decomposed into a series of small dense matrices (frontal matrices) in the method, and then multiple GEMM (General Matrix-matrix Multiplication) operations are computed. GEMMs are the main operations in sparse Cholesky factorization, but they are hardly to perform better in parallel on GPU. In order to improve the performance, the scheme of multiple task queues is adopted when performing multiple GEMMs parallelized with multifrontal method; all GEMM tasks are scheduled dynamically on GPU and CPU based on computation scales for load balance and computing-time reduction. Experimental results show that the approach can outperform the implementations of BLAS and cuBLAS, achieving up to 3.15× and 1.98× speedup, respectively.

查看原文本刊更多论文

稀疏Cholesky分解中基于gpu的多额优化方法

在许多科学计算应用中，稀疏Cholesky分解被用于求解分布式环境下的大型稀疏线性方程。GPU计算是解决这一问题的一种新方法。然而，由于矩阵结构的不规则性和GPU资源的低利用率，稀疏Cholesky分解在GPU上很难取得优异的性能。提出了一种基于多额方法的稀疏Cholesky分解的CPU-GPU混合实现。该方法将一个大的稀疏系数矩阵分解为一系列小的密集矩阵(正面矩阵)，然后进行多次通用矩阵-矩阵乘法运算。gemm是稀疏Cholesky分解的主要操作，但在GPU上很难有更好的并行性能。为了提高性能，在并行执行多个gem时，采用多任务队列方案;所有GEMM任务根据计算规模在GPU和CPU上动态调度，以实现负载均衡和减少计算时间。实验结果表明，该方法优于BLAS和cuBLAS的实现，分别实现了3.15倍和1.98倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

自引率

0.00%

发文量