二权和三权神经网络推理的高效核变换体系

2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC) Pub Date : 2018-06-01 DOI:10.1145/3195970.3195988

Shixuan Zheng, Yonggang Liu, S. Yin, Leibo Liu, Shaojun Wei

{"title":"二权和三权神经网络推理的高效核变换体系","authors":"Shixuan Zheng, Yonggang Liu, S. Yin, Leibo Liu, Shaojun Wei","doi":"10.1145/3195970.3195988","DOIUrl":null,"url":null,"abstract":"While deep convolutional neural networks (CNNs) have emerged as the driving force of a wide range of domains, their computationally and memory intensive natures hinder the further deployment in mobile and embedded applications. Recently, CNNs with low-precision parameters have attracted much research attention. Among them, multiplier-free binary- and ternary-weight CNNs are reported to be of comparable recognition accuracy with full-precision networks, and have been employed to improve the hardware efficiency. However, even with the weights constrained to binary and ternary values, large-scale CNNs still require billions of operations in a single forward propagation pass.In this paper, we introduce a novel approach to maximally eliminate redundancy in binary- and ternary-weight CNN inference, improving both the performance and energy efficiency. The initial kernels are transformed into much fewer and sparser ones, and the output feature maps are rebuilt from the immediate results. Overall, the number of total operations in convolution is reduced. To find an efficient transformation solution for each already trained network, we propose a searching algorithm, which iteratively matches and eliminates the overlap in a set of kernels. We design a specific hardware architecture to optimize the implementation of kernel transformation. Specialized dataflow and scheduling method are proposed. Tested on SVHN, AlexNet, and VGG-16, our architecture removes 43.4%–79.9% operations, and speeds up the inference by 1.48–3.01 times.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"2 1-2","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"An Efficient Kernel Transformation Architecture for Binary- and Ternary-Weight Neural Network Inference\",\"authors\":\"Shixuan Zheng, Yonggang Liu, S. Yin, Leibo Liu, Shaojun Wei\",\"doi\":\"10.1145/3195970.3195988\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While deep convolutional neural networks (CNNs) have emerged as the driving force of a wide range of domains, their computationally and memory intensive natures hinder the further deployment in mobile and embedded applications. Recently, CNNs with low-precision parameters have attracted much research attention. Among them, multiplier-free binary- and ternary-weight CNNs are reported to be of comparable recognition accuracy with full-precision networks, and have been employed to improve the hardware efficiency. However, even with the weights constrained to binary and ternary values, large-scale CNNs still require billions of operations in a single forward propagation pass.In this paper, we introduce a novel approach to maximally eliminate redundancy in binary- and ternary-weight CNN inference, improving both the performance and energy efficiency. The initial kernels are transformed into much fewer and sparser ones, and the output feature maps are rebuilt from the immediate results. Overall, the number of total operations in convolution is reduced. To find an efficient transformation solution for each already trained network, we propose a searching algorithm, which iteratively matches and eliminates the overlap in a set of kernels. We design a specific hardware architecture to optimize the implementation of kernel transformation. Specialized dataflow and scheduling method are proposed. Tested on SVHN, AlexNet, and VGG-16, our architecture removes 43.4%–79.9% operations, and speeds up the inference by 1.48–3.01 times.\",\"PeriodicalId\":6491,\"journal\":{\"name\":\"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)\",\"volume\":\"2 1-2\",\"pages\":\"1-6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3195970.3195988\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3195970.3195988","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

虽然深度卷积神经网络(cnn)已经成为广泛领域的驱动力，但其计算和内存密集型的特性阻碍了其在移动和嵌入式应用中的进一步部署。近年来，低精度参数cnn备受关注。其中，无乘法器二权和三权cnn的识别精度与全精度网络相当，并被用于提高硬件效率。然而，即使将权重限制为二进制和三元值，大规模cnn仍然需要在单个前向传播通道中进行数十亿次操作。在本文中，我们引入了一种新的方法来最大限度地消除二权和三权CNN推理中的冗余，从而提高了性能和能源效率。将初始核转换为更少更稀疏的核，并根据直接结果重建输出的特征映射。总的来说，卷积中的总操作次数减少了。为了找到每个已训练网络的有效转换解，我们提出了一种搜索算法，该算法迭代匹配并消除一组核中的重叠。我们设计了一个特定的硬件架构来优化内核转换的实现。提出了专门的数据流和调度方法。在SVHN, AlexNet和VGG-16上进行了测试，我们的架构消除了43.4%-79.9%的操作，并将推理速度提高了1.48-3.01倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Efficient Kernel Transformation Architecture for Binary- and Ternary-Weight Neural Network Inference

While deep convolutional neural networks (CNNs) have emerged as the driving force of a wide range of domains, their computationally and memory intensive natures hinder the further deployment in mobile and embedded applications. Recently, CNNs with low-precision parameters have attracted much research attention. Among them, multiplier-free binary- and ternary-weight CNNs are reported to be of comparable recognition accuracy with full-precision networks, and have been employed to improve the hardware efficiency. However, even with the weights constrained to binary and ternary values, large-scale CNNs still require billions of operations in a single forward propagation pass.In this paper, we introduce a novel approach to maximally eliminate redundancy in binary- and ternary-weight CNN inference, improving both the performance and energy efficiency. The initial kernels are transformed into much fewer and sparser ones, and the output feature maps are rebuilt from the immediate results. Overall, the number of total operations in convolution is reduced. To find an efficient transformation solution for each already trained network, we propose a searching algorithm, which iteratively matches and eliminates the overlap in a set of kernels. We design a specific hardware architecture to optimize the implementation of kernel transformation. Specialized dataflow and scheduling method are proposed. Tested on SVHN, AlexNet, and VGG-16, our architecture removes 43.4%–79.9% operations, and speeds up the inference by 1.48–3.01 times.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)

自引率

0.00%

发文量