Fixed Point Lanczos: Sustaining TFLOP-equivalent Performance in FPGAs for Scientific Computing

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2012-04-29 DOI:10.1109/FCCM.2012.19

J. Jerez, G. Constantinides, E. Kerrigan

{"title":"Fixed Point Lanczos: Sustaining TFLOP-equivalent Performance in FPGAs for Scientific Computing","authors":"J. Jerez, G. Constantinides, E. Kerrigan","doi":"10.1109/FCCM.2012.19","DOIUrl":null,"url":null,"abstract":"We consider the problem of enabling fixed-point implementations of linear algebra kernels to match the strengths of the field-programmable gate array (FPGA). Algorithms for solving linear equations, finding eigen values or finding singular values are typically nonlinear and recursive making the problem of establishing analytical bounds on variable dynamic range non-trivial. Current approaches fail to provide tight bounds for this type of algorithms. We use as a case study one of the most important kernels in scientific computing, the Lanczos iteration, which lies at the heart of well known methods such as conjugate gradient and minimum residual, and we show how we can modify the algorithm to allow us to apply standard linear algebra analysis to prove tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixed-point implementations of the modified problem can be chosen to be at least as good as a double precision floating point implementation. Using this approach it is possible to get sustained FPGA performance very close to the peak general-purpose graphics processing unit (GPGPU) performance in FPGAs of comparable size when solving a single problem. If there are several independent problems to solve simultaneously it is possible to exceed the peak floating-point performance of a GPGPU, obtaining approximately 1, 2 or 4 TFLOPs for error tolerances of 10-7, 10-5 and 10-3, respectively, in a large Virtex 7 FPGA.","PeriodicalId":226197,"journal":{"name":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2012.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

We consider the problem of enabling fixed-point implementations of linear algebra kernels to match the strengths of the field-programmable gate array (FPGA). Algorithms for solving linear equations, finding eigen values or finding singular values are typically nonlinear and recursive making the problem of establishing analytical bounds on variable dynamic range non-trivial. Current approaches fail to provide tight bounds for this type of algorithms. We use as a case study one of the most important kernels in scientific computing, the Lanczos iteration, which lies at the heart of well known methods such as conjugate gradient and minimum residual, and we show how we can modify the algorithm to allow us to apply standard linear algebra analysis to prove tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixed-point implementations of the modified problem can be chosen to be at least as good as a double precision floating point implementation. Using this approach it is possible to get sustained FPGA performance very close to the peak general-purpose graphics processing unit (GPGPU) performance in FPGAs of comparable size when solving a single problem. If there are several independent problems to solve simultaneously it is possible to exceed the peak floating-point performance of a GPGPU, obtaining approximately 1, 2 or 4 TFLOPs for error tolerances of 10-7, 10-5 and 10-3, respectively, in a large Virtex 7 FPGA.

查看原文本刊更多论文

在科学计算中维持fpga的tflop等效性能

我们考虑使线性代数核的定点实现与现场可编程门阵列(FPGA)的优势相匹配的问题。求解线性方程、寻找特征值或寻找奇异值的算法通常是非线性和递归的，这使得在变动态范围上建立解析界的问题变得不平凡。目前的方法不能为这类算法提供严格的边界。我们以科学计算中最重要的核心之一Lanczos迭代为例进行研究，Lanczos迭代是共轭梯度和最小残差等著名方法的核心，我们展示了如何修改算法，使我们能够应用标准线性代数分析来证明过程中所有变量的紧密解析界，而不管原始矩阵的性质如何。结果表明，修正问题的定点实现的数值行为可以选择至少与双精度浮点实现一样好。使用这种方法，在解决单个问题时，可以获得非常接近同等尺寸FPGA中峰值通用图形处理单元(GPGPU)性能的持续FPGA性能。如果有几个独立的问题需要同时解决，则有可能超过GPGPU的峰值浮点性能，在大型Virtex 7 FPGA中，误差容限分别为10- 7,10 -5和10-3，获得大约1,2或4 TFLOPs。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines

自引率

0.00%

发文量