BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-01-31 DOI:10.1145/3476831

Gongjin Sun, Seongyoung Kang, S. Jun

{"title":"BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression","authors":"Gongjin Sun, Seongyoung Kang, S. Jun","doi":"10.1145/3476831","DOIUrl":null,"url":null,"abstract":"We present BurstZ+, an accelerator platform that eliminates the communication bottleneck between PCIe-attached scientific computing accelerators and their host servers, via hardware-optimized compression. While accelerators such as GPUs and FPGAs provide enormous computing capabilities, their effectiveness quickly deteriorates once data is larger than its on-board memory capacity, and performance becomes limited by the communication bandwidth of moving data between the host memory and accelerator. Compression has not been very useful in solving this issue due to performance and efficiency issues of compressing floating point numbers, which scientific data often consists of. BurstZ+ is an FPGA-based prototype accelerator platform which addresses the bandwidth issue via a class of novel hardware-optimized floating point compression algorithm called ZFP-V. We demonstrate that BurstZ+ can completely remove the host-side communication bottleneck for accelerators, using multiple stencil kernels with a wide range of operational intensities. Evaluated against hand-optimized implementations of kernel accelerators of the same architecture, our single-pipeline BurstZ+ prototype outperforms an accelerator without compression by almost 4×, and even an accelerator with enough memory for the entire dataset by over 2×. Furthermore, the projected performance of BurstZ+ on a future, faster FPGA scales to almost 7× that of the same accelerator without compression, whose performance is still limited by the PCIe bandwidth.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"10 Sup2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3476831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

We present BurstZ+, an accelerator platform that eliminates the communication bottleneck between PCIe-attached scientific computing accelerators and their host servers, via hardware-optimized compression. While accelerators such as GPUs and FPGAs provide enormous computing capabilities, their effectiveness quickly deteriorates once data is larger than its on-board memory capacity, and performance becomes limited by the communication bandwidth of moving data between the host memory and accelerator. Compression has not been very useful in solving this issue due to performance and efficiency issues of compressing floating point numbers, which scientific data often consists of. BurstZ+ is an FPGA-based prototype accelerator platform which addresses the bandwidth issue via a class of novel hardware-optimized floating point compression algorithm called ZFP-V. We demonstrate that BurstZ+ can completely remove the host-side communication bottleneck for accelerators, using multiple stencil kernels with a wide range of operational intensities. Evaluated against hand-optimized implementations of kernel accelerators of the same architecture, our single-pipeline BurstZ+ prototype outperforms an accelerator without compression by almost 4×, and even an accelerator with enough memory for the entire dataset by over 2×. Furthermore, the projected performance of BurstZ+ on a future, faster FPGA scales to almost 7× that of the same accelerator without compression, whose performance is still limited by the PCIe bandwidth.

查看原文本刊更多论文

BurstZ+:通过加速压缩消除科学计算加速器的通信瓶颈

我们提出了BurstZ+，这是一个加速器平台，通过硬件优化压缩消除了连接pcie的科学计算加速器与其主机服务器之间的通信瓶颈。虽然gpu和fpga等加速器提供了巨大的计算能力，但一旦数据大于其板载内存容量，它们的效率就会迅速下降，并且性能会受到主机内存和加速器之间移动数据的通信带宽的限制。由于压缩科学数据通常包含的浮点数的性能和效率问题，压缩在解决这个问题方面并不是很有用。BurstZ+是一个基于fpga的原型加速器平台，它通过一种名为ZFP-V的新型硬件优化浮点压缩算法来解决带宽问题。我们证明了BurstZ+可以完全消除加速器的主机端通信瓶颈，使用多个具有广泛操作强度的模板内核。通过对相同架构的内核加速器的手工优化实现进行评估，我们的单管道BurstZ+原型的性能比没有压缩的加速器高出近4倍，甚至比具有足够内存的加速器高出2倍以上。此外，在未来更快的FPGA上，BurstZ+的预计性能几乎是相同加速器的7倍，没有压缩，其性能仍然受到PCIe带宽的限制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Reconfigurable Technology and Systems (TRETS)

自引率

0.00%

发文量