A Fast Batched Cholesky Factorization on a GPU

2014 43rd International Conference on Parallel Processing Pub Date : 2014-11-20 DOI:10.1109/ICPP.2014.52

Tingxing Dong, A. Haidar, S. Tomov, J. Dongarra

{"title":"A Fast Batched Cholesky Factorization on a GPU","authors":"Tingxing Dong, A. Haidar, S. Tomov, J. Dongarra","doi":"10.1109/ICPP.2014.52","DOIUrl":null,"url":null,"abstract":"Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms -- non-blocked, blocked, and recursive blocked -- were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1.8× speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMAby 1.5× in performance-per-watt for large matrices.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 43rd International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2014.52","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

Abstract

Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms -- non-blocked, blocked, and recursive blocked -- were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1.8× speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMAby 1.5× in performance-per-watt for large matrices.

查看原文本刊更多论文

GPU上的快速批处理Cholesky分解

目前，最先进的库，如MAGMA，专注于非常大的线性代数问题，而解决许多小的独立问题，通常被称为批处理问题，没有得到足够的重视。本文提出了一种基于GPU的批处理Cholesky分解算法。研究了三种算法——非阻塞、阻塞和递归阻塞。在递归阻塞算法中，左看的Cholesky分解版本用于分解面板，右看的Cholesky版本用于更新尾随矩阵。与MKL库中在Intel Sandy Bridge cpu的两个插槽上优化的并行实现相比，我们的批处理Cholesky实现了高达1.8倍的加速。此外，我们使用新的例程开发了针对大矩阵大小的单个Cholesky分解求解器。我们的方法与MAGMA的不同之处在于，我们有一个完全的GPU实现，面板分解和跟踪矩阵更新都在GPU上。这样的实现不依赖于CPU的速度。与MAGMA库相比，我们的完整GPU解决方案实现了混合MAGMA性能的85%，除了使用K40 Nvidia GPU之外，还使用了16个Sandy Bridge核心。此外，我们达到了机器实际dgemm峰值的80%，而MAGMA仅达到75%，最后，在能耗方面，我们在大型矩阵的每瓦性能方面优于MAGMAby 1.5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 43rd International Conference on Parallel Processing

自引率

0.00%

发文量