OpenCL编程中计算与通信相结合的FPGA并行处理

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI:10.1109/IPDPSW.2019.00089

N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, T. Boku

{"title":"OpenCL编程中计算与通信相结合的FPGA并行处理","authors":"N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, T. Boku","doi":"10.1109/IPDPSW.2019.00089","DOIUrl":null,"url":null,"abstract":"In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. \"Channel\"' is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"15 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming\",\"authors\":\"N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, T. Boku\",\"doi\":\"10.1109/IPDPSW.2019.00089\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. \\\"Channel\\\"' is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.\",\"PeriodicalId\":292054,\"journal\":{\"name\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"15 6\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2019.00089\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2019.00089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

近年来，现场可编程门阵列(FPGA)一直是高性能计算(HPC)研究的热点。虽然fpga在高性能计算应用中的最大问题是fpga的开发困难，但这一问题正在通过高级综合(High Level Synthesis, HLS)得到解决。我们专注于非常高性能的fpga间通信能力。FPGA的绝对浮点性能低于gpu等常用加速器。然而，我们认为，如果我们能够在FPGA上结合计算和通信，我们可以将FPGA应用于各种HPC应用。本文的目的是在fpga中结合计算和通信，实现一个运行HLS实现的应用程序的并行处理系统。我们提出了直接连接多个fpga的以太网通道(CoE)系统，用于OpenCL并行编程。“Channel”是Intel OpenCL环境提供的新扩展之一。它们通常用于FPGA内部的内核通信，但是我们通过CoE系统将它们扩展到外部通信。在本文中，我们将引入两个基准测试作为CoE系统的演示。我们在使用乒乓基准测试的系统上实现了29.77 Gbps的吞吐量(大约是理论峰值40Gbps的75%)和950 ns的延迟，该测试是在Intel Arria10 FPGA上实现的。此外，我们在系统上评估了Himeno基准测试，这是一种3d计算流体动力学(CFD)，我们在大小为m的问题上使用4个FPGA实现了23689MFLOPS。我们还注意到强大的可扩展性，在相同的问题大小上，与单个FPGA运行相比，速度提高了3.93倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming

In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. "Channel"' is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量