Cnvlutin:无效-无神经元的深度神经网络计算

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2016-06-18 DOI:10.1145/3007787.3001138

Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos

{"title":"Cnvlutin:无效-无神经元的深度神经网络计算","authors":"Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos","doi":"10.1145/3007787.3001138","DOIUrl":null,"url":null,"abstract":"This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"50 1","pages":"1-13"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"629","resultStr":"{\"title\":\"Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing\",\"authors\":\"Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, Andreas Moshovos\",\"doi\":\"10.1145/3007787.3001138\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.\",\"PeriodicalId\":6634,\"journal\":{\"name\":\"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)\",\"volume\":\"50 1\",\"pages\":\"1-13\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"629\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3007787.3001138\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001138","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 629

摘要

这项工作观察到，深度神经网络(dnn)执行的大部分计算本质上是无效的，因为它们涉及其中一个输入为零的乘法。这一观察结果激发了Cnvolutin (CNV)的灵感，这是一种基于价值的硬件加速方法，消除了大多数无效操作，在不损失精度的情况下，提高了最先进的加速器的性能和能量。CNV使用分层数据并行单元，允许车道组独立进行，使它们能够跳过无效的计算。协同设计的数据存储格式对计算消除决策进行编码，使其远离关键路径，同时避免了数据并行单元中的控制发散。这些单元和数据存储格式结合在一起，形成了一个数据并行架构，该架构保持了对其内存层次结构的宽、对齐访问，并使其数据通道保持繁忙。通过放宽无效计算识别标准，CNV可以进一步提高性能和能效，如果精度损失是可以接受的，则效果会更好。对一组最先进的dnn进行图像分类的实验测量表明，通过去除零值操作数乘法，CNV将最先进的加速器的性能从1.24倍提高到1.55倍，平均提高1.37倍，而精度没有任何损失。虽然CNV会产生4.49%的面积开销，但它将整体EDP(能量延迟积)和ED2P(能量延迟平方积)平均分别提高了1.47倍和2.01倍。使用更广泛的无效识别策略时，平均性能提高到1.52倍，而精度没有任何损失。进一步的改进证明了准确性的损失。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing

This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量