Reducing Memory Requirements for High-Performance and Numerically Stable Gaussian Elimination

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI:10.1145/2847263.2847281

D. Boland

{"title":"Reducing Memory Requirements for High-Performance and Numerically Stable Gaussian Elimination","authors":"D. Boland","doi":"10.1145/2847263.2847281","DOIUrl":null,"url":null,"abstract":"Gaussian elimination is a well-known technique to compute the solution to a system of linear equations and boosting its performance is highly desirable. While straightforward parallel techniques are limited either by I/O or on-chip memory bandwidth, block-based algorithms offer the potential to bridge this gap by interleaving I/O with computation. However, these algorithms require the amount of on-chip memory to be at least the square of the number of processing elements available. Using the latest generation Altera FPGAs with hardened floating-point units, this is no longer the case. It follows that the amount of on-chip memory limits performance, a problem that is only likely to increase unless on-chip memory dominates FPGA architecture. In addition to this limitation, existing FPGA implementations of block-based Gaussian elimination either sacrifice numerical stability or efficiency. The former limits the usefulness of these implementations to a small class of matrices, the latter limits its performance. This paper presents a high-performance and numerically stable method to perform Gaussian elimination on an FPGA. This modified algorithm makes use of a deep pipeline to store the matrix and ensures that the peak performance is once again limited by the number of floating-point units that can fit on the FPGA. When applied to large matrices, this technique can obtain a sustained performance of up to 256 GFLOPs on an Arria 10, beginning to tap into the full potential of these devices. This performance is comparable to the peak that could be achieved using a simple block-based algorithm, with the performance on a Stratix 10 predicted to be superior. This is in spite of the fact that the underlying algorithm for the implementation in this paper, Gaussian elimination with pairwise pivoting, is more complex and applicable to a wider range of practical problems.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2847263.2847281","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Gaussian elimination is a well-known technique to compute the solution to a system of linear equations and boosting its performance is highly desirable. While straightforward parallel techniques are limited either by I/O or on-chip memory bandwidth, block-based algorithms offer the potential to bridge this gap by interleaving I/O with computation. However, these algorithms require the amount of on-chip memory to be at least the square of the number of processing elements available. Using the latest generation Altera FPGAs with hardened floating-point units, this is no longer the case. It follows that the amount of on-chip memory limits performance, a problem that is only likely to increase unless on-chip memory dominates FPGA architecture. In addition to this limitation, existing FPGA implementations of block-based Gaussian elimination either sacrifice numerical stability or efficiency. The former limits the usefulness of these implementations to a small class of matrices, the latter limits its performance. This paper presents a high-performance and numerically stable method to perform Gaussian elimination on an FPGA. This modified algorithm makes use of a deep pipeline to store the matrix and ensures that the peak performance is once again limited by the number of floating-point units that can fit on the FPGA. When applied to large matrices, this technique can obtain a sustained performance of up to 256 GFLOPs on an Arria 10, beginning to tap into the full potential of these devices. This performance is comparable to the peak that could be achieved using a simple block-based algorithm, with the performance on a Stratix 10 predicted to be superior. This is in spite of the fact that the underlying algorithm for the implementation in this paper, Gaussian elimination with pairwise pivoting, is more complex and applicable to a wider range of practical problems.

查看原文本刊更多论文

降低高性能和数值稳定高斯消去的内存需求

高斯消去是一种众所周知的求解线性方程组的技术，提高其性能是非常可取的。虽然直接的并行技术受到I/O或片上内存带宽的限制，但基于块的算法通过将I/O与计算交叉在一起，提供了弥合这一差距的潜力。然而，这些算法要求片上存储器的数量至少是可用处理元素数量的平方。使用带有强化浮点单元的最新一代Altera fpga，这种情况将不复存在。因此，片上存储器的数量限制了性能，除非片上存储器主导FPGA架构，否则这个问题只会增加。除了这个限制之外，现有的基于块的高斯消去的FPGA实现要么牺牲数值稳定性，要么牺牲效率。前者限制了这些实现对一小类矩阵的有用性，后者限制了其性能。本文提出了一种在FPGA上实现高斯消去的高性能、数值稳定的方法。这种改进的算法利用深管道来存储矩阵，并确保峰值性能再次受到可容纳在FPGA上的浮点单元数量的限制。当应用于大型矩阵时，该技术可以在Arria 10上获得高达256 GFLOPs的持续性能，开始挖掘这些设备的全部潜力。这种性能可以与使用简单的基于块的算法实现的峰值相媲美，并且在Stratix 10上的性能预计会更好。尽管本文实现的底层算法——高斯消去法与成对枢轴法——更为复杂，适用于更广泛的实际问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量