Antoniette Mondigo, Tomohiro Ueno, Daichi Tanaka, K. Sano, S. Yamamoto
{"title":"Design and scalability analysis of bandwidth-compressed stream computing with multiple FPGAs","authors":"Antoniette Mondigo, Tomohiro Ueno, Daichi Tanaka, K. Sano, S. Yamamoto","doi":"10.1109/ReCoSoC.2017.8016148","DOIUrl":null,"url":null,"abstract":"Stream computing in Field Programmable Gate Arrays (FPGAs) is seen as a promising solution in delivering the necessary performance and energy efficiency requirements of compute-intensive applications like numerical simulations. The inherent structure and customizability of FPGAs naturally make them the better alternative in achieving a highly-scalable computing design solution. This paper presents a scalable custom computing approach through temporal parallelism by increasing the depth of a computing pipeline in a 1D ring of cascaded FPGAs with high-speed, low-latency communication links. Spatial parallelism is also explored by replicating the computing core inside the FPGAs to further increase throughput. Due to communication bandwidth limitations, a hardware-based lossless bandwidth compression scheme was utilized in order to alleviate this bottleneck and transfer more data streams. A performance model is presented for the scalability analysis and performance estimation of this approach. For evaluation and verification, an actual numerical simulation was implemented on an Intel Arria 10 FPGA with spatially paralleled computing cores. Initial results show that the measured performance ratings are close to the predicted values using the performance model. Similarly, it was also demonstrated that the 1D ring topology of multiple FPGAs with bandwidth-compressed links can scale the performance when a sufficiently large data set is computed, even with a deeper pipeline and insufficient inter-FPGA bandwidth.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReCoSoC.2017.8016148","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Stream computing in Field Programmable Gate Arrays (FPGAs) is seen as a promising solution in delivering the necessary performance and energy efficiency requirements of compute-intensive applications like numerical simulations. The inherent structure and customizability of FPGAs naturally make them the better alternative in achieving a highly-scalable computing design solution. This paper presents a scalable custom computing approach through temporal parallelism by increasing the depth of a computing pipeline in a 1D ring of cascaded FPGAs with high-speed, low-latency communication links. Spatial parallelism is also explored by replicating the computing core inside the FPGAs to further increase throughput. Due to communication bandwidth limitations, a hardware-based lossless bandwidth compression scheme was utilized in order to alleviate this bottleneck and transfer more data streams. A performance model is presented for the scalability analysis and performance estimation of this approach. For evaluation and verification, an actual numerical simulation was implemented on an Intel Arria 10 FPGA with spatially paralleled computing cores. Initial results show that the measured performance ratings are close to the predicted values using the performance model. Similarly, it was also demonstrated that the 1D ring topology of multiple FPGAs with bandwidth-compressed links can scale the performance when a sufficiently large data set is computed, even with a deeper pipeline and insufficient inter-FPGA bandwidth.