Optimization for performance and energy for batched matrix computations on GPUs

Proceedings of the 8th Workshop on General Purpose Processing using GPUs Pub Date : 2015-02-07 DOI:10.1145/2716282.2716288

A. Haidar, Tingxing Dong, P. Luszczek, S. Tomov, J. Dongarra

{"title":"Optimization for performance and energy for batched matrix computations on GPUs","authors":"A. Haidar, Tingxing Dong, P. Luszczek, S. Tomov, J. Dongarra","doi":"10.1145/2716282.2716288","DOIUrl":null,"url":null,"abstract":"As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU's significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and $3$-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU.","PeriodicalId":432610,"journal":{"name":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2716282.2716288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU's significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and $3$-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU.

查看原文本刊更多论文

gpu上批量矩阵计算的性能和能量优化

随着现代硬件的不断发展，开发高效节能和高性能求解器的一种日益有效的方法是将它们设计成能够处理许多小型独立问题。许多应用程序已经需要这种功能，特别是gpu，目前已知它的能效是多核cpu的四到五倍。我们描述了主要的单侧分解的发展，这些单侧分解适用于一组并行的小密矩阵，并说明了我们在LU和Cholesky分解上的技术。我们把这种操作模式称为批处理分解。我们的方法是基于将算法表示为仅在gpu上执行的批处理BLAS例程序列。避免多核CPU使用的目标，例如，在混合CPU-GPU算法中，是专门受益于GPU显著更高的能源效率，以及从昂贵的CPU到GPU通信的移除。此外，我们不使用单个对称多处理器(在GPU上)一次分解单个问题。我们演示了我们的性能分析和使用分析和跟踪工具如何指导批处理分解的开发和优化，以实现与基于MKL库的高度优化的批处理CPU实现(当使用英特尔Sandy Bridge CPU的两个插槽时)相比，高达2倍的加速和3倍的能效。与用于GPU的CUBLAS库中的批处理LU分解相比，我们在K40 GPU上实现了高达2.5的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th Workshop on General Purpose Processing using GPUs

自引率

0.00%

发文量