A Fast Batched Cholesky Factorization on a GPU

Tingxing Dong, A. Haidar, S. Tomov, J. Dongarra
{"title":"A Fast Batched Cholesky Factorization on a GPU","authors":"Tingxing Dong, A. Haidar, S. Tomov, J. Dongarra","doi":"10.1109/ICPP.2014.52","DOIUrl":null,"url":null,"abstract":"Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms -- non-blocked, blocked, and recursive blocked -- were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1.8× speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMAby 1.5× in performance-per-watt for large matrices.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 43rd International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2014.52","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36

Abstract

Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms -- non-blocked, blocked, and recursive blocked -- were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1.8× speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMAby 1.5× in performance-per-watt for large matrices.
GPU上的快速批处理Cholesky分解
目前,最先进的库,如MAGMA,专注于非常大的线性代数问题,而解决许多小的独立问题,通常被称为批处理问题,没有得到足够的重视。本文提出了一种基于GPU的批处理Cholesky分解算法。研究了三种算法——非阻塞、阻塞和递归阻塞。在递归阻塞算法中,左看的Cholesky分解版本用于分解面板,右看的Cholesky版本用于更新尾随矩阵。与MKL库中在Intel Sandy Bridge cpu的两个插槽上优化的并行实现相比,我们的批处理Cholesky实现了高达1.8倍的加速。此外,我们使用新的例程开发了针对大矩阵大小的单个Cholesky分解求解器。我们的方法与MAGMA的不同之处在于,我们有一个完全的GPU实现,面板分解和跟踪矩阵更新都在GPU上。这样的实现不依赖于CPU的速度。与MAGMA库相比,我们的完整GPU解决方案实现了混合MAGMA性能的85%,除了使用K40 Nvidia GPU之外,还使用了16个Sandy Bridge核心。此外,我们达到了机器实际dgemm峰值的80%,而MAGMA仅达到75%,最后,在能耗方面,我们在大型矩阵的每瓦性能方面优于MAGMAby 1.5倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信