N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, T. Boku
{"title":"OpenCL编程中计算与通信相结合的FPGA并行处理","authors":"N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, T. Boku","doi":"10.1109/IPDPSW.2019.00089","DOIUrl":null,"url":null,"abstract":"In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. \"Channel\"' is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"15 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming\",\"authors\":\"N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, T. Boku\",\"doi\":\"10.1109/IPDPSW.2019.00089\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. \\\"Channel\\\"' is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.\",\"PeriodicalId\":292054,\"journal\":{\"name\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"15 6\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2019.00089\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2019.00089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming
In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. "Channel"' is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.