A High Energy-Efficiency FPGA-Based LSTM Accelerator Architecture Design by Structured Pruning and Normalized Linear Quantization

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI:10.1109/ICFPT47387.2019.00045

Yong Zheng, Haigang Yang, Zhihong Huang, Tianli Li, Yiping Jia

{"title":"A High Energy-Efficiency FPGA-Based LSTM Accelerator Architecture Design by Structured Pruning and Normalized Linear Quantization","authors":"Yong Zheng, Haigang Yang, Zhihong Huang, Tianli Li, Yiping Jia","doi":"10.1109/ICFPT47387.2019.00045","DOIUrl":null,"url":null,"abstract":"LSTM (Long Short-Term Memory) is an artificial recurrent neural network (RNN) architecture and has been successfully applied to the areas where sequences of data need to be dealt with such as Natural Language Processing (NLP), speech recognition, etc. In this work, we explore an avenue to minimization of the LSTM inference part design based on FPGA for high performance and energy-efficiency. First, the model is pruned to create structured sparse features for the hardware-friendly purpose by using permuted block diagonal mask matrices. To further compress the model, we quantize the weights and activations following a normalized linear quantization approach. As a result, computational activities of the network are significantly deducted with an egligible loss on accuracy. Then a hardware architecture design has been devised to fully exploit the benefits of regular sparse structure. Having been implemented on Arria 10 (10AX115U4F45I3SG) FPGA running at 150 MHz, our accelerator demonstrates a peak performance of 2.22 TOPS at a power dissipation of 1.679 Watts. In comparison to the other FPGA-based LSTM accelerator designs previously reported, our approach achieves a 1.17-2.16x speedup in processing.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

LSTM (Long Short-Term Memory) is an artificial recurrent neural network (RNN) architecture and has been successfully applied to the areas where sequences of data need to be dealt with such as Natural Language Processing (NLP), speech recognition, etc. In this work, we explore an avenue to minimization of the LSTM inference part design based on FPGA for high performance and energy-efficiency. First, the model is pruned to create structured sparse features for the hardware-friendly purpose by using permuted block diagonal mask matrices. To further compress the model, we quantize the weights and activations following a normalized linear quantization approach. As a result, computational activities of the network are significantly deducted with an egligible loss on accuracy. Then a hardware architecture design has been devised to fully exploit the benefits of regular sparse structure. Having been implemented on Arria 10 (10AX115U4F45I3SG) FPGA running at 150 MHz, our accelerator demonstrates a peak performance of 2.22 TOPS at a power dissipation of 1.679 Watts. In comparison to the other FPGA-based LSTM accelerator designs previously reported, our approach achieves a 1.17-2.16x speedup in processing.

查看原文本刊更多论文

基于结构化剪枝和归一化线性量化的高效fpga LSTM加速器结构设计

LSTM (Long - Short-Term Memory)是一种人工递归神经网络(RNN)架构，已成功应用于自然语言处理(NLP)、语音识别等需要处理序列数据的领域。在这项工作中，我们探索了一种基于FPGA的最小化LSTM推理部分设计的途径，以实现高性能和高能效。首先，通过使用排列块对角掩模矩阵，对模型进行修剪以创建结构化的稀疏特征，从而实现硬件友好的目的。为了进一步压缩模型，我们采用规范化线性量化方法对权重和激活进行量化。因此，网络的计算活动被大大扣除，而准确性的损失可以忽略不计。然后设计了一个硬件架构设计，以充分利用规则稀疏结构的优点。在运行在150 MHz的Arria 10 (10AX115U4F45I3SG) FPGA上实现后，我们的加速器在1.679瓦的功耗下显示出2.22 TOPS的峰值性能。与之前报道的其他基于fpga的LSTM加速器设计相比，我们的方法在处理上实现了1.17-2.16倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量