Design and scalability analysis of bandwidth-compressed stream computing with multiple FPGAs

2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC) Pub Date : 2017-07-01 DOI:10.1109/ReCoSoC.2017.8016148

Antoniette Mondigo, Tomohiro Ueno, Daichi Tanaka, K. Sano, S. Yamamoto

{"title":"Design and scalability analysis of bandwidth-compressed stream computing with multiple FPGAs","authors":"Antoniette Mondigo, Tomohiro Ueno, Daichi Tanaka, K. Sano, S. Yamamoto","doi":"10.1109/ReCoSoC.2017.8016148","DOIUrl":null,"url":null,"abstract":"Stream computing in Field Programmable Gate Arrays (FPGAs) is seen as a promising solution in delivering the necessary performance and energy efficiency requirements of compute-intensive applications like numerical simulations. The inherent structure and customizability of FPGAs naturally make them the better alternative in achieving a highly-scalable computing design solution. This paper presents a scalable custom computing approach through temporal parallelism by increasing the depth of a computing pipeline in a 1D ring of cascaded FPGAs with high-speed, low-latency communication links. Spatial parallelism is also explored by replicating the computing core inside the FPGAs to further increase throughput. Due to communication bandwidth limitations, a hardware-based lossless bandwidth compression scheme was utilized in order to alleviate this bottleneck and transfer more data streams. A performance model is presented for the scalability analysis and performance estimation of this approach. For evaluation and verification, an actual numerical simulation was implemented on an Intel Arria 10 FPGA with spatially paralleled computing cores. Initial results show that the measured performance ratings are close to the predicted values using the performance model. Similarly, it was also demonstrated that the 1D ring topology of multiple FPGAs with bandwidth-compressed links can scale the performance when a sufficiently large data set is computed, even with a deeper pipeline and insufficient inter-FPGA bandwidth.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReCoSoC.2017.8016148","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Stream computing in Field Programmable Gate Arrays (FPGAs) is seen as a promising solution in delivering the necessary performance and energy efficiency requirements of compute-intensive applications like numerical simulations. The inherent structure and customizability of FPGAs naturally make them the better alternative in achieving a highly-scalable computing design solution. This paper presents a scalable custom computing approach through temporal parallelism by increasing the depth of a computing pipeline in a 1D ring of cascaded FPGAs with high-speed, low-latency communication links. Spatial parallelism is also explored by replicating the computing core inside the FPGAs to further increase throughput. Due to communication bandwidth limitations, a hardware-based lossless bandwidth compression scheme was utilized in order to alleviate this bottleneck and transfer more data streams. A performance model is presented for the scalability analysis and performance estimation of this approach. For evaluation and verification, an actual numerical simulation was implemented on an Intel Arria 10 FPGA with spatially paralleled computing cores. Initial results show that the measured performance ratings are close to the predicted values using the performance model. Similarly, it was also demonstrated that the 1D ring topology of multiple FPGAs with bandwidth-compressed links can scale the performance when a sufficiently large data set is computed, even with a deeper pipeline and insufficient inter-FPGA bandwidth.

查看原文本刊更多论文

基于多fpga的带宽压缩流计算设计与可扩展性分析

现场可编程门阵列(fpga)中的流计算被视为一种很有前途的解决方案，可以为数值模拟等计算密集型应用提供必要的性能和能效要求。fpga固有的结构和可定制性自然使它们成为实现高度可扩展计算设计解决方案的更好选择。本文提出了一种可扩展的自定义计算方法，通过时间并行性，通过增加具有高速，低延迟通信链路的级联fpga的一维环中的计算管道的深度。通过在fpga内部复制计算核心来进一步提高吞吐量，探索了空间并行性。由于通信带宽的限制，为了缓解这一瓶颈，传输更多的数据流，采用了基于硬件的无损带宽压缩方案。针对该方法的可扩展性分析和性能评估，提出了一个性能模型。为了评估和验证，在具有空间并行计算核的Intel Arria 10 FPGA上进行了实际数值模拟。初步结果表明，测量的性能等级接近使用性能模型的预测值。同样，还证明了具有带宽压缩链路的多个fpga的1D环拓扑可以在计算足够大的数据集时扩展性能，即使有更深的管道和fpga间带宽不足。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)

自引率

0.00%

发文量