A Fast Implementation of Matrix-matrix Product in Double-double Precision on NVIDIA C2050 and Application to Semidefinite Programming

2012 Third International Conference on Networking and Computing Pub Date : 2012-12-05 DOI:10.1109/ICNC.2012.19

Maho Nakata, Yasuyoshi Takao, S. Noda, R. Himeno

{"title":"A Fast Implementation of Matrix-matrix Product in Double-double Precision on NVIDIA C2050 and Application to Semidefinite Programming","authors":"Maho Nakata, Yasuyoshi Takao, S. Noda, R. Himeno","doi":"10.1109/ICNC.2012.19","DOIUrl":null,"url":null,"abstract":"We have implemented a fast double-double precision (has approx. 32 decimal significant digits) version of matrix-matrix multiplication routine called \"Rgemm\" of MPACK (http://mplapack.sourceforge.net/) on NVIDIA C2050 GPU. This routine is a higher precision version of gdgemmh in the BLAS (Basic Linear Algebra Subprograms) library. Our implementation is the fastest to date using NVIDIA C2050 and most efficient on NVIDIA GPUs, we achieved the peak performances of 16.4GFlops for the kernel performance (16.1GFlops with CPU-GPU transfer included), and 26.4GFlops (25.7GFlops with CPU-GPU transfer included) by employing lower accuracy arithmetic. These are 92.3% (90.7%) and 87.1% (84.8%) of the theoretical peak performance of NVIDIA C2050, which is about 150 times faster than the reference implementation on Intel Xeon X3470. Moreover, our implementations can handle arbitrary sizes of matrices by employing gPointer redirectingh technique by Nath et al. We integrated this GPU-accelerated version of Rgemm for double-double precision version of semi definite programming solver called SDPA-DD, and the performance improved at most 14.5 times. This version of Rgemm is available at http://mplapack.sourceforge.net/ since 2011/10/28.","PeriodicalId":442973,"journal":{"name":"2012 Third International Conference on Networking and Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Third International Conference on Networking and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNC.2012.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

We have implemented a fast double-double precision (has approx. 32 decimal significant digits) version of matrix-matrix multiplication routine called "Rgemm" of MPACK (http://mplapack.sourceforge.net/) on NVIDIA C2050 GPU. This routine is a higher precision version of gdgemmh in the BLAS (Basic Linear Algebra Subprograms) library. Our implementation is the fastest to date using NVIDIA C2050 and most efficient on NVIDIA GPUs, we achieved the peak performances of 16.4GFlops for the kernel performance (16.1GFlops with CPU-GPU transfer included), and 26.4GFlops (25.7GFlops with CPU-GPU transfer included) by employing lower accuracy arithmetic. These are 92.3% (90.7%) and 87.1% (84.8%) of the theoretical peak performance of NVIDIA C2050, which is about 150 times faster than the reference implementation on Intel Xeon X3470. Moreover, our implementations can handle arbitrary sizes of matrices by employing gPointer redirectingh technique by Nath et al. We integrated this GPU-accelerated version of Rgemm for double-double precision version of semi definite programming solver called SDPA-DD, and the performance improved at most 14.5 times. This version of Rgemm is available at http://mplapack.sourceforge.net/ since 2011/10/28.

查看原文本刊更多论文

在NVIDIA C2050上双精度矩阵-矩阵乘积的快速实现及其在半定编程中的应用

我们已经实现了一个快速的双双精度(大约有。32十进制有效数字)版本的矩阵-矩阵乘法例程称为MPACK的“Rgemm”(http://mplapack.sourceforge.net/)在NVIDIA C2050 GPU上。这个例程是BLAS(基本线性代数子程序)库中dggemmh的更高精度版本。我们的实现是迄今为止使用NVIDIA C2050最快的，在NVIDIA gpu上效率最高，我们通过采用较低精度的算法实现了内核性能的峰值性能16.4GFlops(包括CPU-GPU传输的16.1GFlops)和26.4GFlops(包括CPU-GPU传输的25.7GFlops)。这是NVIDIA C2050理论峰值性能的92.3%(90.7%)和87.1%(84.8%)，比Intel Xeon X3470上的参考实现快约150倍。此外，我们的实现可以通过使用Nath等人的gPointer重定向技术来处理任意大小的矩阵。我们将这种gpu加速版本的Rgemm集成到双精度版本的半确定规划求解器SDPA-DD中，性能最多提高了14.5倍。此版本的Rgemm自2011年10月28日起在http://mplapack.sourceforge.net/上可用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 Third International Conference on Networking and Computing

自引率

0.00%

发文量