Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform

Pub Date : 2022-12-06 DOI:10.14429/dsj.72.17656
S. Rawal, Indivar Gupta
{"title":"Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform","authors":"S. Rawal, Indivar Gupta","doi":"10.14429/dsj.72.17656","DOIUrl":null,"url":null,"abstract":"We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU.","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14429/dsj.72.17656","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU.
分享
查看原文
多节点多GPU平台上GF(2)上大型稀疏线性方程组的求解
我们提供了块Wiedemann算法(BWA)的高效多节点、多GPU实现,以找到GF(2)上大型稀疏线性方程组的解。求解这类系统的一个重要应用出现在大多数整数分解算法中,如数域筛。在本文中,我们描述了如何适应混合并行化来加快BWA最耗时的序列生成阶段。该阶段涉及生成矩阵矩阵乘积和矩阵转置矩阵乘积的序列,其中矩阵非常大,高度稀疏,并且在GF(2)上具有条目。我们描述了一种GPU加速的并行方法,用于使用诸如使用MPI和CUDA的多节点多GPU平台上的第一矩阵的逐行并行分布以及第二矩阵的行的逐字异或之类的技术来计算这些矩阵矩阵乘积。我们还描述了矩阵转置矩阵乘积计算的混合并行化,其中我们使用MPI将两个矩阵逐行划分为大小相等的块。然后,在GPU加速的矩阵转置矩阵乘积生成之后,我们使用MPI_Reduce中的MPI_BXOR运算对所有这些块进行组合,以获得结果。将使用多个GPU的混合集群上序列生成步骤的混合并行化性能与仅在多个MPI处理器上的并行化性能进行了比较。我们已经将这种混合并行序列生成工具用于HPC集群的基准测试。本文还使用DGX站的多达4个NVidia V100 GPU对RSA-130、RSA-140和RSA-170的数域筛矩阵的完整解决方案的详细时间进行了比较。在4个V100 GPU上并行后,与在1个GPU上并行相比,我们获得了2.8的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信