High-performance Cholesky factorization for GPU-only execution

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI:10.1145/3038228.3038237

A. Haidar, A. Abdelfattah, S. Tomov, J. Dongarra

{"title":"High-performance Cholesky factorization for GPU-only execution","authors":"A. Haidar, A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1145/3038228.3038237","DOIUrl":null,"url":null,"abstract":"We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. Weachieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3038228.3038237","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. Weachieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.

查看原文本刊更多论文

仅用于gpu执行的高性能Cholesky分解

我们介绍了我们的性能分析、算法设计，以及开发高性能gpu算法所需的优化，特别是密集Cholesky分解。目前推广的设计通过将算法表示为有向无环图(dag)来解决多核架构上的并行性挑战，其中节点是细粒度的任务，边缘是任务之间的依赖关系，与此相反，我们的设计明确针对多核架构，如gpu和特征粗粒度任务(可以分层分为细粒度数据并行子任务)。此外，与在cpu上调度难以并行化任务的混合算法相比，我们开发了完全用于GPU执行的高效代码。仅gpu代码消除了昂贵的CPU到gpu通信以及与慢CPU和/或低CPU到gpu带宽相关的调优挑战。我们表明，在最新的GPU上，如P100，这变得如此重要，以至于当CPU任务和通信不能完全与GPU计算重叠时，仅GPU代码甚至优于混合MAGMA算法。我们在P100 GPU上实现了高达4300 GFlop/s的双精度，这比高端多核cpu快了7-8倍，例如两个10核Intel Xeon E5-2650 v3 Haswell cpu, MKL的运行速度高达500-600 GFlop/s。新算法的性能也明显优于目前NVIDIA cuSOLVER库中仅支持gpu的实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the General Purpose GPUs

自引率

0.00%

发文量