基于压缩深度神经网络的高效推理引擎

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2016-02-04 DOI:10.1145/3007787.3001163

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, A. Pedram, M. Horowitz, W. Dally

{"title":"基于压缩深度神经网络的高效推理引擎","authors":"Song Han, Xingyu Liu, Huizi Mao, Jing Pu, A. Pedram, M. Horowitz, W. Dally","doi":"10.1145/3007787.3001163","DOIUrl":null,"url":null,"abstract":"State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving, Exploiting sparsity saves 10x, Weight sharing gives 8x, Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88x104 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"232 1","pages":"243-254"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2224","resultStr":"{\"title\":\"EIE: Efficient Inference Engine on Compressed Deep Neural Network\",\"authors\":\"Song Han, Xingyu Liu, Huizi Mao, Jing Pu, A. Pedram, M. Horowitz, W. Dally\",\"doi\":\"10.1145/3007787.3001163\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving, Exploiting sparsity saves 10x, Weight sharing gives 8x, Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88x104 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.\",\"PeriodicalId\":6634,\"journal\":{\"name\":\"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)\",\"volume\":\"232 1\",\"pages\":\"243-254\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2224\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3007787.3001163\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2224

摘要

最先进的深度神经网络(dnn)有数亿个连接，并且计算和内存都很密集，这使得它们很难部署在硬件资源和功耗预算有限的嵌入式系统上。虽然定制硬件有助于计算，但从DRAM中获取权重比ALU操作要贵两个数量级，并且占据了所需的功率。先前提出的“深度压缩”使大型dnn (AlexNet和VGGNet)完全适合片上SRAM成为可能。这种压缩是通过修剪冗余连接和让多个连接共享相同的权重来实现的。我们提出了一种节能推理引擎(EIE)，该引擎在该压缩网络模型上执行推理，并通过权值共享加速得到的稀疏矩阵向量乘法。从DRAM到SRAM可以使eee节省120倍的能源，利用稀疏性节省10倍，重量共享节省8倍，从ReLU跳过零激活又节省3倍。在9个DNN基准测试中进行评估，与CPU和GPU实现相同的DNN相比，EIE在没有压缩的情况下分别快了189倍和13倍。EIE在压缩网络上直接工作的处理能力为102 GOPS，相当于未压缩网络上的3 TOPS，以1.88x104帧/秒的速度处理AlexNet的FC层，功耗仅为600mW。它的能效分别是CPU和GPU的24000倍和3400倍。与大电脑相比，EIE的吞吐量、能效和面积效率分别提高2.9倍、19倍和3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

EIE: Efficient Inference Engine on Compressed Deep Neural Network

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving, Exploiting sparsity saves 10x, Weight sharing gives 8x, Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88x104 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量