Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform

IF 0.8 4区工程技术 Q3 MULTIDISCIPLINARY SCIENCES

Defence Science Journal Pub Date : 2022-12-06 DOI:10.14429/dsj.72.17656

S. Rawal, Indivar Gupta

{"title":"Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform","authors":"S. Rawal, Indivar Gupta","doi":"10.14429/dsj.72.17656","DOIUrl":null,"url":null,"abstract":"We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU.","PeriodicalId":11043,"journal":{"name":"Defence Science Journal","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Defence Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14429/dsj.72.17656","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU.

查看原文本刊更多论文

多节点多GPU平台上GF(2)上大型稀疏线性方程组的求解

我们提供了块Wiedemann算法（BWA）的高效多节点、多GPU实现，以找到GF（2）上大型稀疏线性方程组的解。求解这类系统的一个重要应用出现在大多数整数分解算法中，如数域筛。在本文中，我们描述了如何适应混合并行化来加快BWA最耗时的序列生成阶段。该阶段涉及生成矩阵矩阵乘积和矩阵转置矩阵乘积的序列，其中矩阵非常大，高度稀疏，并且在GF（2）上具有条目。我们描述了一种GPU加速的并行方法，用于使用诸如使用MPI和CUDA的多节点多GPU平台上的第一矩阵的逐行并行分布以及第二矩阵的行的逐字异或之类的技术来计算这些矩阵矩阵乘积。我们还描述了矩阵转置矩阵乘积计算的混合并行化，其中我们使用MPI将两个矩阵逐行划分为大小相等的块。然后，在GPU加速的矩阵转置矩阵乘积生成之后，我们使用MPI_Reduce中的MPI_BXOR运算对所有这些块进行组合，以获得结果。将使用多个GPU的混合集群上序列生成步骤的混合并行化性能与仅在多个MPI处理器上的并行化性能进行了比较。我们已经将这种混合并行序列生成工具用于HPC集群的基准测试。本文还使用DGX站的多达4个NVidia V100 GPU对RSA-130、RSA-140和RSA-170的数域筛矩阵的完整解决方案的详细时间进行了比较。在4个V100 GPU上并行后，与在1个GPU上并行相比，我们获得了2.8的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Defence Science Journal 综合性期刊-综合性期刊

CiteScore

1.80

自引率

11.10%

发文量

审稿时长

7.5 months

期刊介绍： Defence Science Journal is a peer-reviewed, multidisciplinary research journal in the area of defence science and technology. Journal feature recent progresses made in the field of defence/military support system and new findings/breakthroughs, etc. Major subject fields covered include: aeronautics, armaments, combat vehicles and engineering, biomedical sciences, computer sciences, electronics, material sciences, missiles, naval systems, etc.