Optimization of Numerous Small Dense-Matrix–Vector Multiplications in H-Matrix Arithmetic on GPU

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) Pub Date : 2019-10-01 DOI:10.1109/MCSoC.2019.00009

S. Ohshima, I. Yamazaki, Akihiro Ida, Rio Yokota

{"title":"Optimization of Numerous Small Dense-Matrix–Vector Multiplications in H-Matrix Arithmetic on GPU","authors":"S. Ohshima, I. Yamazaki, Akihiro Ida, Rio Yokota","doi":"10.1109/MCSoC.2019.00009","DOIUrl":null,"url":null,"abstract":"Dense-matrix–vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix–vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)–vector multiplication. However, some applications require acceleration of numerous small dense-matrix–vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix–vector multiplication. In this study, we implemented numerous small dense-matrix–vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47% and from batched GEMV of MAGMA BLAS is upto 81.81%. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5%), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix–vector multiplications on a GPU and want to optimize a matrix–vector multiplication by hand-tuning and auto-tuning.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MCSoC.2019.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Dense-matrix–vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix–vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)–vector multiplication. However, some applications require acceleration of numerous small dense-matrix–vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix–vector multiplication. In this study, we implemented numerous small dense-matrix–vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47% and from batched GEMV of MAGMA BLAS is upto 81.81%. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5%), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix–vector multiplications on a GPU and want to optimize a matrix–vector multiplication by hand-tuning and auto-tuning.

查看原文本刊更多论文

GPU上h -矩阵算法中大量小密度矩阵向量乘法的优化

密集矩阵-向量乘法是众所周知的重要矩阵计算之一。在基本线性代数子程序(BLAS)库中为几种计算硬件提供了通用矩阵向量乘法(GEMV)函数。传统上，研究的重点是一个大的密集矩阵(密集矩阵的每条边的长度都很长)-向量乘法。然而，一些应用程序需要加速许多小的密集矩阵-向量乘法。此特性由批处理BLAS库提供。这个计算也需要计算一个层次矩阵向量乘法。在这项研究中，我们在Pascal GPU上实现了许多小的密集矩阵向量乘法，并评估了性能。因此，我们考虑了优化参数的影响，并成功地获得了比以往工作更好的性能。与前人的最大差异为28.47%，与MAGMA BLAS的批处理GEMV最大差异为81.81%。此外，我们考虑在一个GPU内核中使用两个优化参数;一个参数应用于一些矩阵，而第二个参数应用于其他矩阵。改进的数量有限(最多5%)，但实现了性能改进。我们的结果将为需要在GPU上使用大量小的密集矩阵向量乘法并希望通过手动调优和自动调优来优化矩阵向量乘法的用户提供很好的参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

自引率

0.00%

发文量