{"title":"HPIPE NX: Boosting CNN Inference Acceleration Performance with AI-Optimized FPGAs","authors":"Marius Stan, Mathew Hall, M. Ibrahim, Vaughn Betz","doi":"10.1109/ICFPT56656.2022.9974441","DOIUrl":null,"url":null,"abstract":"With the ever-increasing compute demands of artificial intelligence (AI) workloads, there is extensive interest in leveraging field-programmable gate-arrays (FPGAs) to quickly deploy hardware accelerators for the latest convolutional neural networks (CNNs). Recent FPGA architectures are also evolving to better serve the needs of AI, but accelerators need extensive re-design to leverage these new features. The Stratix 10 NX chip by Intel is a new FPGA that replaces traditional DSP blocks with in-fabric AI tensor blocks that provide 15x more multipliers and up to 143 TOPS of performance, at the cost of lower precision (INT8) and significant restrictions on how many operands can be fed to the multipliers from the programmable routing. In this paper, we explore different CNN accelerator structures to leverage the tensor blocks, considering the various tensor block modes, operand bandwidth restrictions, and on-chip memory restrictions. We incorporate the most performant techniques into HPIPE, a layer-pipelined and sparse-aware CNN accelerator for FPGAs. We enhance HPIPE's software compiler to restructure the CNN computations and on-chip memory layout to take advantage of the additional multipliers offered by the new tensor block architecture, while also avoiding stalls due to data loading restrictions. We achieve cycle-by-cycle speedups in tensor mode of up to $\\mathbf{8}.\\mathbf{3}\\mathbf{x}$ for Mobilenet-v1 versus the original HPIPE design using conventional DSPs. On the FPGA, we achieve a throughput of 28,541 and 29,429 images/s on Mobilenet-v1 and Mobilenet-v2 respectively, outperforming all previous FPGA accelerators by at least 4.0x, including one on an AI-optimized Xilinx chip. We also outperform NVIDIA's V100 GPU, a machine learning targeted GPU on a similar process node with a $\\mathbf{1}.\\mathbf{7}\\mathbf{x}$ larger die size, by up to 17x with a batch size of one and 1.3x with NVIDIA's largest reported batch size of 128.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT56656.2022.9974441","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
With the ever-increasing compute demands of artificial intelligence (AI) workloads, there is extensive interest in leveraging field-programmable gate-arrays (FPGAs) to quickly deploy hardware accelerators for the latest convolutional neural networks (CNNs). Recent FPGA architectures are also evolving to better serve the needs of AI, but accelerators need extensive re-design to leverage these new features. The Stratix 10 NX chip by Intel is a new FPGA that replaces traditional DSP blocks with in-fabric AI tensor blocks that provide 15x more multipliers and up to 143 TOPS of performance, at the cost of lower precision (INT8) and significant restrictions on how many operands can be fed to the multipliers from the programmable routing. In this paper, we explore different CNN accelerator structures to leverage the tensor blocks, considering the various tensor block modes, operand bandwidth restrictions, and on-chip memory restrictions. We incorporate the most performant techniques into HPIPE, a layer-pipelined and sparse-aware CNN accelerator for FPGAs. We enhance HPIPE's software compiler to restructure the CNN computations and on-chip memory layout to take advantage of the additional multipliers offered by the new tensor block architecture, while also avoiding stalls due to data loading restrictions. We achieve cycle-by-cycle speedups in tensor mode of up to $\mathbf{8}.\mathbf{3}\mathbf{x}$ for Mobilenet-v1 versus the original HPIPE design using conventional DSPs. On the FPGA, we achieve a throughput of 28,541 and 29,429 images/s on Mobilenet-v1 and Mobilenet-v2 respectively, outperforming all previous FPGA accelerators by at least 4.0x, including one on an AI-optimized Xilinx chip. We also outperform NVIDIA's V100 GPU, a machine learning targeted GPU on a similar process node with a $\mathbf{1}.\mathbf{7}\mathbf{x}$ larger die size, by up to 17x with a batch size of one and 1.3x with NVIDIA's largest reported batch size of 128.