Exploiting High-Bandwidth Memory for FPGA-Acceleration of Inference on Sum-Product Networks

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI:10.1109/IPDPSW55747.2022.00028

Lukas Weber, John M. Wirth, Lukas Sommer, A. Koch

{"title":"Exploiting High-Bandwidth Memory for FPGA-Acceleration of Inference on Sum-Product Networks","authors":"Lukas Weber, John M. Wirth, Lukas Sommer, A. Koch","doi":"10.1109/IPDPSW55747.2022.00028","DOIUrl":null,"url":null,"abstract":"Due to the memory wall becoming increasingly problematic in high-performance computing, there is a steady push to improve memory architectures, mainly focusing on better bandwidth as well as latency. One of the results of this push is the development of High-Bandwidth Memory (HBM) which is an alternative to the regular DRAM typically used by accelerator-cards. This work adapts an existing accelerator architecture for inference on Sum-Product Networks (SPN) to exploit the HBM present on more recent high-performance FPGA-accelerator cards. The evaluation shows that the use of HBM enables almost linear scaling of the performance due to the embarrassingly parallel nature of batch-wise SPN inference. It is also shown that the only hindrance to this scaling is the limited bandwidth available for data-transfers between host and FPGA. Even with this bottleneck, the prior FPGA-based implementation is outperformed by up to 1.50x (geo.-mean 1.29x). Similarly, the CPU and GPU baselines are outperformed by up to 2.4x (geo.-mean 1.6x) and 8.4x (geo.-mean 6.9x) respectively. Based on the evaluation, the scaling potential of HBM-based FPGA-accelerators is explored to give an outlook on what is to come with future generations of PCIe-based interfaces.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Due to the memory wall becoming increasingly problematic in high-performance computing, there is a steady push to improve memory architectures, mainly focusing on better bandwidth as well as latency. One of the results of this push is the development of High-Bandwidth Memory (HBM) which is an alternative to the regular DRAM typically used by accelerator-cards. This work adapts an existing accelerator architecture for inference on Sum-Product Networks (SPN) to exploit the HBM present on more recent high-performance FPGA-accelerator cards. The evaluation shows that the use of HBM enables almost linear scaling of the performance due to the embarrassingly parallel nature of batch-wise SPN inference. It is also shown that the only hindrance to this scaling is the limited bandwidth available for data-transfers between host and FPGA. Even with this bottleneck, the prior FPGA-based implementation is outperformed by up to 1.50x (geo.-mean 1.29x). Similarly, the CPU and GPU baselines are outperformed by up to 2.4x (geo.-mean 1.6x) and 8.4x (geo.-mean 6.9x) respectively. Based on the evaluation, the scaling potential of HBM-based FPGA-accelerators is explored to give an outlook on what is to come with future generations of PCIe-based interfaces.

查看原文本刊更多论文

利用高带宽存储器实现和积网络的fpga加速推理

由于内存墙在高性能计算中变得越来越成问题，人们一直在努力改进内存架构，主要关注更好的带宽和延迟。这种推动的结果之一是高带宽存储器(HBM)的发展，它是加速卡通常使用的常规DRAM的替代品。这项工作采用了用于和积网络(SPN)推理的现有加速器架构，以利用最新高性能fpga加速卡上的HBM。评估表明，由于批处理SPN推理令人尴尬的并行特性，HBM的使用使性能几乎呈线性扩展。它还表明，这种扩展的唯一障碍是主机和FPGA之间的数据传输可用的有限带宽。即使存在这个瓶颈，先前基于fpga的实现的性能也优于1.50倍(geo)。意思是1.29 x)。同样，CPU和GPU的基准性能最高优于2.4倍(geo)。-平均1.6倍)和8.4倍(geo。-平均6.9x)。在此基础上，探讨了基于hbm的fpga加速器的扩展潜力，并对未来几代基于pcie的接口进行了展望。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量