HPIPE NX: Boosting CNN Inference Acceleration Performance with AI-Optimized FPGAs

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI:10.1109/ICFPT56656.2022.9974441

Marius Stan, Mathew Hall, M. Ibrahim, Vaughn Betz

{"title":"HPIPE NX: Boosting CNN Inference Acceleration Performance with AI-Optimized FPGAs","authors":"Marius Stan, Mathew Hall, M. Ibrahim, Vaughn Betz","doi":"10.1109/ICFPT56656.2022.9974441","DOIUrl":null,"url":null,"abstract":"With the ever-increasing compute demands of artificial intelligence (AI) workloads, there is extensive interest in leveraging field-programmable gate-arrays (FPGAs) to quickly deploy hardware accelerators for the latest convolutional neural networks (CNNs). Recent FPGA architectures are also evolving to better serve the needs of AI, but accelerators need extensive re-design to leverage these new features. The Stratix 10 NX chip by Intel is a new FPGA that replaces traditional DSP blocks with in-fabric AI tensor blocks that provide 15x more multipliers and up to 143 TOPS of performance, at the cost of lower precision (INT8) and significant restrictions on how many operands can be fed to the multipliers from the programmable routing. In this paper, we explore different CNN accelerator structures to leverage the tensor blocks, considering the various tensor block modes, operand bandwidth restrictions, and on-chip memory restrictions. We incorporate the most performant techniques into HPIPE, a layer-pipelined and sparse-aware CNN accelerator for FPGAs. We enhance HPIPE's software compiler to restructure the CNN computations and on-chip memory layout to take advantage of the additional multipliers offered by the new tensor block architecture, while also avoiding stalls due to data loading restrictions. We achieve cycle-by-cycle speedups in tensor mode of up to $\\mathbf{8}.\\mathbf{3}\\mathbf{x}$ for Mobilenet-v1 versus the original HPIPE design using conventional DSPs. On the FPGA, we achieve a throughput of 28,541 and 29,429 images/s on Mobilenet-v1 and Mobilenet-v2 respectively, outperforming all previous FPGA accelerators by at least 4.0x, including one on an AI-optimized Xilinx chip. We also outperform NVIDIA's V100 GPU, a machine learning targeted GPU on a similar process node with a $\\mathbf{1}.\\mathbf{7}\\mathbf{x}$ larger die size, by up to 17x with a batch size of one and 1.3x with NVIDIA's largest reported batch size of 128.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT56656.2022.9974441","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

With the ever-increasing compute demands of artificial intelligence (AI) workloads, there is extensive interest in leveraging field-programmable gate-arrays (FPGAs) to quickly deploy hardware accelerators for the latest convolutional neural networks (CNNs). Recent FPGA architectures are also evolving to better serve the needs of AI, but accelerators need extensive re-design to leverage these new features. The Stratix 10 NX chip by Intel is a new FPGA that replaces traditional DSP blocks with in-fabric AI tensor blocks that provide 15x more multipliers and up to 143 TOPS of performance, at the cost of lower precision (INT8) and significant restrictions on how many operands can be fed to the multipliers from the programmable routing. In this paper, we explore different CNN accelerator structures to leverage the tensor blocks, considering the various tensor block modes, operand bandwidth restrictions, and on-chip memory restrictions. We incorporate the most performant techniques into HPIPE, a layer-pipelined and sparse-aware CNN accelerator for FPGAs. We enhance HPIPE's software compiler to restructure the CNN computations and on-chip memory layout to take advantage of the additional multipliers offered by the new tensor block architecture, while also avoiding stalls due to data loading restrictions. We achieve cycle-by-cycle speedups in tensor mode of up to $\mathbf{8}.\mathbf{3}\mathbf{x}$ for Mobilenet-v1 versus the original HPIPE design using conventional DSPs. On the FPGA, we achieve a throughput of 28,541 and 29,429 images/s on Mobilenet-v1 and Mobilenet-v2 respectively, outperforming all previous FPGA accelerators by at least 4.0x, including one on an AI-optimized Xilinx chip. We also outperform NVIDIA's V100 GPU, a machine learning targeted GPU on a similar process node with a $\mathbf{1}.\mathbf{7}\mathbf{x}$ larger die size, by up to 17x with a batch size of one and 1.3x with NVIDIA's largest reported batch size of 128.

查看原文本刊更多论文

HPIPE NX:利用ai优化的fpga提升CNN推理加速性能

随着人工智能(AI)工作负载的计算需求不断增加，利用现场可编程门阵列(fpga)为最新的卷积神经网络(cnn)快速部署硬件加速器引起了广泛的兴趣。最近的FPGA架构也在不断发展，以更好地满足人工智能的需求，但加速器需要大量的重新设计才能利用这些新功能。英特尔的Stratix 10 NX芯片是一款新型FPGA，它用fabric内AI张量块取代了传统的DSP块，提供了15倍的乘法器和高达143 TOPS的性能，但代价是精度较低(INT8)，并且对可编程路由馈送到乘法器的操作数有很大限制。在本文中，我们探索了不同的CNN加速器结构来利用张量块，考虑了各种张量块模式、操作数带宽限制和片上内存限制。我们将最高性能的技术整合到HPIPE中，这是一种用于fpga的分层流水线和稀疏感知CNN加速器。我们增强了HPIPE的软件编译器，以重构CNN计算和片上存储器布局，以利用新张量块架构提供的额外乘数器，同时也避免了由于数据加载限制而导致的停机。我们在张量模式下实现了一个周期一个周期的加速，最高可达$\mathbf{8}。\mathbf{3}\mathbf{x}$与使用传统dsp的原始HPIPE设计相比。在FPGA上，我们在Mobilenet-v1和Mobilenet-v2上分别实现了28,541和29,429图像/s的吞吐量，比以前所有的FPGA加速器至少高出4.0倍，包括在ai优化的赛灵思芯片上的加速器。我们的性能也优于NVIDIA的V100 GPU，这是一款基于$\mathbf{1}的类似过程节点的机器学习目标GPU。\mathbf{7}\mathbf{x}$的芯片尺寸更大，批量大小为1时最大可达17倍，批量大小为128时最大可达1.3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量