А基于gpu的正交矩阵分解算法产生双对角线形状

NaUKMA Research Papers. Computer Science Pub Date : 2021-12-10 DOI:10.18523/2617-3808.2021.4.10-15

G. Malaschonok, Serhii Sukharskyi

{"title":"А基于gpu的正交矩阵分解算法产生双对角线形状","authors":"G. Malaschonok, Serhii Sukharskyi","doi":"10.18523/2617-3808.2021.4.10-15","DOIUrl":null,"url":null,"abstract":"\n \n \nWith the development of the Big Data sphere, as well as those fields of study that we can relate to artificial intelligence, the need for fast and efficient computing has become one of the most important tasks nowadays. That is why in the recent decade, graphics processing unit computations have been actively developing to provide an ability for scientists and developers to use thousands of cores GPUs have in order to perform intensive computations. The goal of this research is to implement orthogonal decomposition of a matrix by applying a series of Householder transformations in Java language using JCuda library to conduct a research on its benefits. Several related papers were examined. Malaschonok and Savchenko in their work have introduced an improved version of QR algorithm for this purpose [4] and achieved better results, however Householder algorithm is more promising for GPUs according to another team of researchers – Lahabar and Narayanan [6]. However, they were using Float numbers, while we are using Double, and apart from that we are working on a new BigDecimal type for CUDA. Apart from that, there is still no solution for handling huge matrices where errors in calculations might occur. \nThe algorithm of orthogonal matrix decomposition, which is the first part of SVD algorithm, is researched and implemented in this work. The implementation of matrix bidiagonalization and calculation of orthogonal factors by the Hausholder method in the jCUDA environment on a graphics processor is presented, and the algorithm for the central processor for comparisons is also implemented. Research of the received results where we experimentally measured acceleration of calculations with the use of the graphic processor in comparison with the implementation on the central processor are carried out. We show a speedup up to 53 times compared to CPU implementation on a big matrix size, specifically 2048, and even better results when using more advanced GPUs. At the same time, we still experience bigger errors in calculations while using graphic processing units due to synchronization problems. We compared execution on different platforms (Windows 10 and Arch Linux) and discovered that they are almost the same, taking the computation speed into account. The results have shown that on GPU we can achieve better performance, however there are more implementation difficulties with this approach. \n \n \n","PeriodicalId":433538,"journal":{"name":"NaUKMA Research Papers. Computer Science","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"А Gpu-based Orthogonal Matrix Factorization Algorithm that Produces a Two-Diagonal Shape\",\"authors\":\"G. Malaschonok, Serhii Sukharskyi\",\"doi\":\"10.18523/2617-3808.2021.4.10-15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n \\n \\nWith the development of the Big Data sphere, as well as those fields of study that we can relate to artificial intelligence, the need for fast and efficient computing has become one of the most important tasks nowadays. That is why in the recent decade, graphics processing unit computations have been actively developing to provide an ability for scientists and developers to use thousands of cores GPUs have in order to perform intensive computations. The goal of this research is to implement orthogonal decomposition of a matrix by applying a series of Householder transformations in Java language using JCuda library to conduct a research on its benefits. Several related papers were examined. Malaschonok and Savchenko in their work have introduced an improved version of QR algorithm for this purpose [4] and achieved better results, however Householder algorithm is more promising for GPUs according to another team of researchers – Lahabar and Narayanan [6]. However, they were using Float numbers, while we are using Double, and apart from that we are working on a new BigDecimal type for CUDA. Apart from that, there is still no solution for handling huge matrices where errors in calculations might occur. \\nThe algorithm of orthogonal matrix decomposition, which is the first part of SVD algorithm, is researched and implemented in this work. The implementation of matrix bidiagonalization and calculation of orthogonal factors by the Hausholder method in the jCUDA environment on a graphics processor is presented, and the algorithm for the central processor for comparisons is also implemented. Research of the received results where we experimentally measured acceleration of calculations with the use of the graphic processor in comparison with the implementation on the central processor are carried out. We show a speedup up to 53 times compared to CPU implementation on a big matrix size, specifically 2048, and even better results when using more advanced GPUs. At the same time, we still experience bigger errors in calculations while using graphic processing units due to synchronization problems. We compared execution on different platforms (Windows 10 and Arch Linux) and discovered that they are almost the same, taking the computation speed into account. The results have shown that on GPU we can achieve better performance, however there are more implementation difficulties with this approach. \\n \\n \\n\",\"PeriodicalId\":433538,\"journal\":{\"name\":\"NaUKMA Research Papers. Computer Science\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NaUKMA Research Papers. Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18523/2617-3808.2021.4.10-15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NaUKMA Research Papers. Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18523/2617-3808.2021.4.10-15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

随着大数据领域以及与人工智能相关的研究领域的发展，对快速高效计算的需求已成为当今最重要的任务之一。这就是为什么在最近十年中，图形处理单元计算一直在积极发展，为科学家和开发人员提供使用数千核gpu来执行密集计算的能力。本研究的目的是通过在Java语言中应用一系列Householder变换来实现矩阵的正交分解，并使用JCuda库对其优点进行研究。研究了几篇有关的论文。Malaschonok和Savchenko在他们的工作中为此引入了一种改进版本的QR算法b[4]，并取得了更好的结果，然而根据另一组研究人员Lahabar和Narayanan b[6]， Householder算法更有希望用于gpu。然而，他们使用Float数，而我们使用Double数，除此之外，我们正在为CUDA开发一个新的BigDecimal类型。除此之外，对于可能出现计算错误的大型矩阵，仍然没有解决方案。本文研究并实现了SVD算法的第一部分——正交矩阵分解算法。给出了在图形处理器jCUDA环境下用Hausholder方法实现矩阵双对角化和正交因子的计算，并实现了中央处理器的比较算法。本文还对图形处理器与中央处理器的计算速度进行了对比研究。我们展示了在大矩阵大小(特别是2048)下，与CPU实现相比，加速高达53倍，并且在使用更高级的gpu时效果更好。与此同时，由于同步问题，我们在使用图形处理单元时仍然会遇到较大的计算错误。我们比较了不同平台(Windows 10和Arch Linux)上的执行情况，发现考虑到计算速度，它们几乎是相同的。结果表明，在GPU上我们可以获得更好的性能，但是这种方法有更多的实现困难。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

А Gpu-based Orthogonal Matrix Factorization Algorithm that Produces a Two-Diagonal Shape

With the development of the Big Data sphere, as well as those fields of study that we can relate to artificial intelligence, the need for fast and efficient computing has become one of the most important tasks nowadays. That is why in the recent decade, graphics processing unit computations have been actively developing to provide an ability for scientists and developers to use thousands of cores GPUs have in order to perform intensive computations. The goal of this research is to implement orthogonal decomposition of a matrix by applying a series of Householder transformations in Java language using JCuda library to conduct a research on its benefits. Several related papers were examined. Malaschonok and Savchenko in their work have introduced an improved version of QR algorithm for this purpose [4] and achieved better results, however Householder algorithm is more promising for GPUs according to another team of researchers – Lahabar and Narayanan [6]. However, they were using Float numbers, while we are using Double, and apart from that we are working on a new BigDecimal type for CUDA. Apart from that, there is still no solution for handling huge matrices where errors in calculations might occur. The algorithm of orthogonal matrix decomposition, which is the first part of SVD algorithm, is researched and implemented in this work. The implementation of matrix bidiagonalization and calculation of orthogonal factors by the Hausholder method in the jCUDA environment on a graphics processor is presented, and the algorithm for the central processor for comparisons is also implemented. Research of the received results where we experimentally measured acceleration of calculations with the use of the graphic processor in comparison with the implementation on the central processor are carried out. We show a speedup up to 53 times compared to CPU implementation on a big matrix size, specifically 2048, and even better results when using more advanced GPUs. At the same time, we still experience bigger errors in calculations while using graphic processing units due to synchronization problems. We compared execution on different platforms (Windows 10 and Arch Linux) and discovered that they are almost the same, taking the computation speed into account. The results have shown that on GPU we can achieve better performance, however there are more implementation difficulties with this approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NaUKMA Research Papers. Computer Science

自引率

0.00%

发文量