Fast HBM Access with FPGAs: Analysis, Architectures, and Applications

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2021-06-01 DOI:10.1109/IPDPSW52791.2021.00030

Philipp Holzinger, Daniel Reiser, Tobias Hahn, M. Reichenbach

{"title":"Fast HBM Access with FPGAs: Analysis, Architectures, and Applications","authors":"Philipp Holzinger, Daniel Reiser, Tobias Hahn, M. Reichenbach","doi":"10.1109/IPDPSW52791.2021.00030","DOIUrl":null,"url":null,"abstract":"Over the past few decades, the gap between rapidly increasing computational power and almost stagnating memory bandwidth has steadily worsened. Recently, 3D die-stacking in form of High Bandwidth Memory (HBM) enabled the first major jump in external memory throughput in years. In contrast to traditional DRAM it compensates its lower clock frequency with wide busses and a high number of separate channels. However, this also requires data to be spread out over all channels to reach the full throughput. Previous research relied on manual HBM data partitioning schemes and handled each channel as an entirely independent entity. This paper in contrast also considers scalable hardware adaptions and approaches system design holistically. In this process we first analyze the problem with real world measurements on a Xilinx HBM FPGA. Then we derive several architectural changes to improve throughput and ease accelerator design. Finally, a Roofline based model to more accurately estimate the expected performance in advance is presented. With these measures we were able to increase the throughput by up to 3.78× with random and 40.6× with certain strided access patterns compared to Xilinx’ state-of-the-art switch fabric.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Over the past few decades, the gap between rapidly increasing computational power and almost stagnating memory bandwidth has steadily worsened. Recently, 3D die-stacking in form of High Bandwidth Memory (HBM) enabled the first major jump in external memory throughput in years. In contrast to traditional DRAM it compensates its lower clock frequency with wide busses and a high number of separate channels. However, this also requires data to be spread out over all channels to reach the full throughput. Previous research relied on manual HBM data partitioning schemes and handled each channel as an entirely independent entity. This paper in contrast also considers scalable hardware adaptions and approaches system design holistically. In this process we first analyze the problem with real world measurements on a Xilinx HBM FPGA. Then we derive several architectural changes to improve throughput and ease accelerator design. Finally, a Roofline based model to more accurately estimate the expected performance in advance is presented. With these measures we were able to increase the throughput by up to 3.78× with random and 40.6× with certain strided access patterns compared to Xilinx’ state-of-the-art switch fabric.

查看原文本刊更多论文

快速HBM访问fpga:分析，架构和应用

在过去的几十年里，快速增长的计算能力和几乎停滞不前的内存带宽之间的差距不断恶化。最近，高带宽存储器(HBM)形式的3D模堆使外部存储器吞吐量多年来首次出现重大飞跃。与传统的DRAM相比，它用宽总线和大量的独立通道来补偿其较低的时钟频率。但是，这也需要将数据分散到所有通道以达到完全吞吐量。以前的研究依赖于手动HBM数据分区方案，并将每个通道作为一个完全独立的实体来处理。相比之下，本文还考虑了可扩展的硬件适应和整体方法的系统设计。在这个过程中，我们首先分析了在Xilinx HBM FPGA上的实际测量问题。然后，我们得出了一些架构上的变化，以提高吞吐量和简化加速器的设计。最后，提出了一种基于rooline的模型来更准确地预估预期性能。通过这些措施，与赛灵思最先进的交换机结构相比，我们能够将随机访问模式的吞吐量提高3.78倍，将某些跨步访问模式的吞吐量提高40.6倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量