Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication

Nicholas Contini, B. Ramesh, Kaushik Kandadi Suresh, Tu Tran, Benjamin Michalowicz, M. Abduljabbar, H. Subramoni, D. Panda
{"title":"Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication","authors":"Nicholas Contini, B. Ramesh, Kaushik Kandadi Suresh, Tu Tran, Benjamin Michalowicz, M. Abduljabbar, H. Subramoni, D. Panda","doi":"10.1145/3577193.3593720","DOIUrl":null,"url":null,"abstract":"Modern HPC faces new challenges with the slowing of Moore's Law and the end of Dennard Scaling. Traditional computing architectures can no longer be expected to drive today's HPC loads, as shown by the adoption of heterogeneous system design leveraging accelerators such as GPUs and TPUs. Recently, FPGAs have become viable candidates as HPC accelerators. These devices can accelerate workloads by replicating implemented compute units to enable task parallelism, overlapping computation between and within kernels to enable pipeline parallelism, and increasing data locality by sending data directly between compute units. While many solutions for inter-FPGA communication have been presented, these proposed designs generally rely on inter-FPGA networks, unique system setups, and/or the consumption of soft logic resources on the chip. In this paper, we propose an FPGA-aware MPI runtime that avoids such shortcomings. Our MPI implementation does not use any special system setup other than plugging FPGA accelerators into PCIe slots. All communication is orchestrated by the host, utilizing the PCIe interconnect and inter-host network to implement message passing. We propose advanced designs that address data movement challenges and reduce the need for explicit data movement between the device and host (staging) in FPGA applications. We achieve up to 50% reduction in latency for point-to-point transfers compared to application-level staging.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"188 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Modern HPC faces new challenges with the slowing of Moore's Law and the end of Dennard Scaling. Traditional computing architectures can no longer be expected to drive today's HPC loads, as shown by the adoption of heterogeneous system design leveraging accelerators such as GPUs and TPUs. Recently, FPGAs have become viable candidates as HPC accelerators. These devices can accelerate workloads by replicating implemented compute units to enable task parallelism, overlapping computation between and within kernels to enable pipeline parallelism, and increasing data locality by sending data directly between compute units. While many solutions for inter-FPGA communication have been presented, these proposed designs generally rely on inter-FPGA networks, unique system setups, and/or the consumption of soft logic resources on the chip. In this paper, we propose an FPGA-aware MPI runtime that avoids such shortcomings. Our MPI implementation does not use any special system setup other than plugging FPGA accelerators into PCIe slots. All communication is orchestrated by the host, utilizing the PCIe interconnect and inter-host network to implement message passing. We propose advanced designs that address data movement challenges and reduce the need for explicit data movement between the device and host (staging) in FPGA applications. We achieve up to 50% reduction in latency for point-to-point transfers compared to application-level staging.
通过基于mpi的fpga间通信实现可重构HPC
随着摩尔定律的放缓和登纳德尺度的终结,现代高性能计算面临着新的挑战。传统的计算架构不能再期望驱动今天的HPC负载,正如采用利用gpu和tpu等加速器的异构系统设计所表明的那样。最近,fpga已经成为高性能计算加速器的可行候选。这些设备可以通过复制实现的计算单元来实现任务并行性、在内核之间和内核内部重叠计算来实现流水线并行性,以及通过在计算单元之间直接发送数据来增加数据局部性来加速工作负载。虽然已经提出了许多fpga间通信的解决方案,但这些建议的设计通常依赖于fpga间网络,独特的系统设置和/或芯片上软逻辑资源的消耗。在本文中,我们提出了一个fpga感知的MPI运行时,以避免这些缺点。我们的MPI实现除了将FPGA加速器插入PCIe插槽外,不使用任何特殊的系统设置。所有通信都由主机编排,利用PCIe互连和主机间网络实现消息传递。我们提出了解决数据移动挑战的高级设计,并减少了FPGA应用中设备和主机(分段)之间显式数据移动的需求。与应用级分级相比,我们将点对点传输的延迟减少了50%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信