Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication

Proceedings of the 37th International Conference on Supercomputing Pub Date : 2023-06-21 DOI:10.1145/3577193.3593720

Nicholas Contini, B. Ramesh, Kaushik Kandadi Suresh, Tu Tran, Benjamin Michalowicz, M. Abduljabbar, H. Subramoni, D. Panda

{"title":"Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication","authors":"Nicholas Contini, B. Ramesh, Kaushik Kandadi Suresh, Tu Tran, Benjamin Michalowicz, M. Abduljabbar, H. Subramoni, D. Panda","doi":"10.1145/3577193.3593720","DOIUrl":null,"url":null,"abstract":"Modern HPC faces new challenges with the slowing of Moore's Law and the end of Dennard Scaling. Traditional computing architectures can no longer be expected to drive today's HPC loads, as shown by the adoption of heterogeneous system design leveraging accelerators such as GPUs and TPUs. Recently, FPGAs have become viable candidates as HPC accelerators. These devices can accelerate workloads by replicating implemented compute units to enable task parallelism, overlapping computation between and within kernels to enable pipeline parallelism, and increasing data locality by sending data directly between compute units. While many solutions for inter-FPGA communication have been presented, these proposed designs generally rely on inter-FPGA networks, unique system setups, and/or the consumption of soft logic resources on the chip. In this paper, we propose an FPGA-aware MPI runtime that avoids such shortcomings. Our MPI implementation does not use any special system setup other than plugging FPGA accelerators into PCIe slots. All communication is orchestrated by the host, utilizing the PCIe interconnect and inter-host network to implement message passing. We propose advanced designs that address data movement challenges and reduce the need for explicit data movement between the device and host (staging) in FPGA applications. We achieve up to 50% reduction in latency for point-to-point transfers compared to application-level staging.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"188 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Modern HPC faces new challenges with the slowing of Moore's Law and the end of Dennard Scaling. Traditional computing architectures can no longer be expected to drive today's HPC loads, as shown by the adoption of heterogeneous system design leveraging accelerators such as GPUs and TPUs. Recently, FPGAs have become viable candidates as HPC accelerators. These devices can accelerate workloads by replicating implemented compute units to enable task parallelism, overlapping computation between and within kernels to enable pipeline parallelism, and increasing data locality by sending data directly between compute units. While many solutions for inter-FPGA communication have been presented, these proposed designs generally rely on inter-FPGA networks, unique system setups, and/or the consumption of soft logic resources on the chip. In this paper, we propose an FPGA-aware MPI runtime that avoids such shortcomings. Our MPI implementation does not use any special system setup other than plugging FPGA accelerators into PCIe slots. All communication is orchestrated by the host, utilizing the PCIe interconnect and inter-host network to implement message passing. We propose advanced designs that address data movement challenges and reduce the need for explicit data movement between the device and host (staging) in FPGA applications. We achieve up to 50% reduction in latency for point-to-point transfers compared to application-level staging.

查看原文本刊更多论文

通过基于mpi的fpga间通信实现可重构HPC

随着摩尔定律的放缓和登纳德尺度的终结，现代高性能计算面临着新的挑战。传统的计算架构不能再期望驱动今天的HPC负载，正如采用利用gpu和tpu等加速器的异构系统设计所表明的那样。最近，fpga已经成为高性能计算加速器的可行候选。这些设备可以通过复制实现的计算单元来实现任务并行性、在内核之间和内核内部重叠计算来实现流水线并行性，以及通过在计算单元之间直接发送数据来增加数据局部性来加速工作负载。虽然已经提出了许多fpga间通信的解决方案，但这些建议的设计通常依赖于fpga间网络，独特的系统设置和/或芯片上软逻辑资源的消耗。在本文中，我们提出了一个fpga感知的MPI运行时，以避免这些缺点。我们的MPI实现除了将FPGA加速器插入PCIe插槽外，不使用任何特殊的系统设置。所有通信都由主机编排，利用PCIe互连和主机间网络实现消息传递。我们提出了解决数据移动挑战的高级设计，并减少了FPGA应用中设备和主机(分段)之间显式数据移动的需求。与应用级分级相比，我们将点对点传输的延迟减少了50%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 37th International Conference on Supercomputing

自引率

0.00%

发文量