ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-12-01 DOI:10.1145/3020078.3021745

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, W. Dally

{"title":"ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA","authors":"Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, W. Dally","doi":"10.1145/3020078.3021745","DOIUrl":null,"url":null,"abstract":"Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model. Implemented on Xilinx KU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, corresponding to 2.52 TOPS on the dense one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"570","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3020078.3021745","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 570

Abstract

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model. Implemented on Xilinx KU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, corresponding to 2.52 TOPS on the dense one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

查看原文本刊更多论文

基于FPGA的高效稀疏LSTM语音识别引擎

长短期记忆(LSTM)在语音识别中有着广泛的应用。为了达到更高的预测精度，机器学习科学家已经建立了越来越大的模型。如此庞大的模型既需要大量的计算，又需要大量的内存。部署如此庞大的模型会导致高功耗，并导致数据中心的高总拥有成本(TCO)。为了加速预测并使其节能，我们首先提出了一种负载平衡感知的修剪方法，该方法可以将LSTM模型大小压缩20倍(修剪10倍，量化2倍)，而预测精度的损失可以忽略不计。剪枝模型便于并行处理。接下来，我们提出了一个调度器，该调度器将压缩模型编码并划分为多个pe以实现并行性，并调度复杂的LSTM数据流。最后，设计了直接作用于稀疏LSTM模型的高效语音识别引擎(ESE)硬件架构。ESE在运行频率为200MHz的Xilinx KU060 FPGA上实现，在稀疏LSTM网络上直接工作的性能为282 GOPS，在密集LSTM网络上对应2.52 TOPS，处理一个完整的LSTM用于语音识别，功耗为41瓦。在LSTM的语音识别基准测试中，ESE比Core i7 5930k CPU和Pascal Titan X GPU实现分别快43倍和3倍。与CPU和GPU相比，能效分别提高了40倍和11.5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量