Application Composition and Communication Optimization in Iterative Solvers Using FPGAs

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI:10.1109/FCCM.2013.16

A. Rafique, Nachiket Kapre, G. Constantinides

{"title":"Application Composition and Communication Optimization in Iterative Solvers Using FPGAs","authors":"A. Rafique, Nachiket Kapre, G. Constantinides","doi":"10.1109/FCCM.2013.16","DOIUrl":null,"url":null,"abstract":"We consider the problem of minimizing communication with off-chip memory and composition of multiple linear algebra kernels in iterative solvers for solving large-scale eigenvalue problems and linear systems of equations. While GPUs may offer higher throughput for individual kernels, overall application performance is limited by the inability to support on-chip sharing of data across kernels. In this paper, we show that higher on-chip memory capacity and superior on-chip communication bandwidth enables FPGAs to better support the composition of a sequence of kernels within these iterative solvers. We present a time-multiplexed FPGA architecture which exploits the on-chip capacity to store dependencies between kernels and high communication bandwidth to move data. We propose a resource-constrained framework to select the optimal value of an algorithmic parameter which provides the tradeoff between communication and computation cost for a particular FPGA. Using the Lanczos Method as a case study, we show how to minimize communication on FPGAs by this tight algorithm-architecture interaction and get superior performance over GPU despite of its ~5x larger off-chip memory bandwidth and ~2x greater peak singleprecision floating-point performance.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2013.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

We consider the problem of minimizing communication with off-chip memory and composition of multiple linear algebra kernels in iterative solvers for solving large-scale eigenvalue problems and linear systems of equations. While GPUs may offer higher throughput for individual kernels, overall application performance is limited by the inability to support on-chip sharing of data across kernels. In this paper, we show that higher on-chip memory capacity and superior on-chip communication bandwidth enables FPGAs to better support the composition of a sequence of kernels within these iterative solvers. We present a time-multiplexed FPGA architecture which exploits the on-chip capacity to store dependencies between kernels and high communication bandwidth to move data. We propose a resource-constrained framework to select the optimal value of an algorithmic parameter which provides the tradeoff between communication and computation cost for a particular FPGA. Using the Lanczos Method as a case study, we show how to minimize communication on FPGAs by this tight algorithm-architecture interaction and get superior performance over GPU despite of its ~5x larger off-chip memory bandwidth and ~2x greater peak singleprecision floating-point performance.

查看原文本刊更多论文

基于fpga的迭代求解器的应用组合与通信优化

我们考虑了在求解大规模特征值问题和线性方程组的迭代解算器中最小化与片外存储器的通信和多重线性代数核的组成问题。虽然gpu可以为单个内核提供更高的吞吐量，但由于无法支持跨内核的片上数据共享，整体应用程序性能受到限制。在本文中，我们证明了更高的片上存储容量和优越的片上通信带宽使fpga能够更好地支持这些迭代求解器中一系列核的组成。我们提出了一种时间复用FPGA架构，它利用片上容量来存储内核之间的依赖关系和高通信带宽来移动数据。我们提出了一个资源约束框架来选择算法参数的最优值，为特定的FPGA提供通信和计算成本之间的权衡。使用Lanczos方法作为案例研究，我们展示了如何通过这种紧密的算法-架构交互来最小化fpga上的通信，并获得优于GPU的性能，尽管其片外内存带宽大~5倍，峰值单精度浮点性能高~2倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines

自引率

0.00%

发文量