基于软件的神经网络变精度乘法

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI:10.1109/HPEC43674.2020.9286170

Richa Singh, Thomas Conroy, P. Schaumont

{"title":"基于软件的神经网络变精度乘法","authors":"Richa Singh, Thomas Conroy, P. Schaumont","doi":"10.1109/HPEC43674.2020.9286170","DOIUrl":null,"url":null,"abstract":"As the number of applications of neural networks continues to grow, so does the need to efficiently perform inference computations on highly constrained devices. In this paper, we propose a methodology to accelerate neural networks in software. We exploit the limited-precision requirements of typical neural networks by formulating recurring operations in a bit-slice computation format. Bit-slice computation ensures that every bit of an $M$-bit processor word contributes useful work even while computing a limited-precision n-bit (with $n$ < M) operation. This paper brings the following contributions. We first present an environment to efficiently create bitslice descriptions in software, by synthesizing them from Verilog. We then develop bitsliced designs of matrix multiplication and evaluate their performance. Our target is a small microcontroller, and we rely solely on software optimization. Our driving application is a neural network classifier for the MNIST database. Range-Based Linear Quantization in symmetric mode quantizes pre-trained 32-bit floating point weights and activation to low-precision data-widths. Experiments on RISC-V with varying levels of hardware-support show that for data-widths common to neural network applications, the bit-sliced code produces a speedup over traditional methods, which leads to faster and efficient inference without incurring significant loss in accuracy. For example, 8-bit matrix multiplications are sped up by a factor of 2.62× when compared with non-bitsliced rv32i ISA implementation with no hardware multiplier.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Variable Precision Multiplication for Software-Based Neural Networks\",\"authors\":\"Richa Singh, Thomas Conroy, P. Schaumont\",\"doi\":\"10.1109/HPEC43674.2020.9286170\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the number of applications of neural networks continues to grow, so does the need to efficiently perform inference computations on highly constrained devices. In this paper, we propose a methodology to accelerate neural networks in software. We exploit the limited-precision requirements of typical neural networks by formulating recurring operations in a bit-slice computation format. Bit-slice computation ensures that every bit of an $M$-bit processor word contributes useful work even while computing a limited-precision n-bit (with $n$ < M) operation. This paper brings the following contributions. We first present an environment to efficiently create bitslice descriptions in software, by synthesizing them from Verilog. We then develop bitsliced designs of matrix multiplication and evaluate their performance. Our target is a small microcontroller, and we rely solely on software optimization. Our driving application is a neural network classifier for the MNIST database. Range-Based Linear Quantization in symmetric mode quantizes pre-trained 32-bit floating point weights and activation to low-precision data-widths. Experiments on RISC-V with varying levels of hardware-support show that for data-widths common to neural network applications, the bit-sliced code produces a speedup over traditional methods, which leads to faster and efficient inference without incurring significant loss in accuracy. For example, 8-bit matrix multiplications are sped up by a factor of 2.62× when compared with non-bitsliced rv32i ISA implementation with no hardware multiplier.\",\"PeriodicalId\":168544,\"journal\":{\"name\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"55 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC43674.2020.9286170\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

随着神经网络应用数量的不断增长，在高度受限的设备上高效执行推理计算的需求也在不断增长。在本文中，我们提出了一种在软件中加速神经网络的方法。我们利用典型神经网络的有限精度要求，在位片计算格式中制定循环运算。位片计算确保$M$位处理器字的每一位都贡献有用的工作，即使在计算精度有限的n位(使用$n$ < M)操作时也是如此。本文带来了以下贡献。我们首先提出了一个在软件中有效地创建位片描述的环境，通过从Verilog中合成它们。然后，我们开发了矩阵乘法的位切片设计并评估了它们的性能。我们的目标是一个小型微控制器，我们完全依赖于软件优化。我们的驱动应用是MNIST数据库的神经网络分类器。基于范围的线性量化在对称模式量化预训练的32位浮点权重和激活到低精度的数据宽度。在不同硬件支持级别的RISC-V上进行的实验表明，对于神经网络应用中常见的数据宽度，比特切片代码比传统方法产生了加速，从而导致更快，更有效的推理，而不会导致准确性的显着损失。例如，与没有硬件乘法器的非位切片rv32i ISA实现相比，8位矩阵乘法的速度提高了2.62倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Variable Precision Multiplication for Software-Based Neural Networks

As the number of applications of neural networks continues to grow, so does the need to efficiently perform inference computations on highly constrained devices. In this paper, we propose a methodology to accelerate neural networks in software. We exploit the limited-precision requirements of typical neural networks by formulating recurring operations in a bit-slice computation format. Bit-slice computation ensures that every bit of an $M$-bit processor word contributes useful work even while computing a limited-precision n-bit (with $n$ < M) operation. This paper brings the following contributions. We first present an environment to efficiently create bitslice descriptions in software, by synthesizing them from Verilog. We then develop bitsliced designs of matrix multiplication and evaluate their performance. Our target is a small microcontroller, and we rely solely on software optimization. Our driving application is a neural network classifier for the MNIST database. Range-Based Linear Quantization in symmetric mode quantizes pre-trained 32-bit floating point weights and activation to low-precision data-widths. Experiments on RISC-V with varying levels of hardware-support show that for data-widths common to neural network applications, the bit-sliced code produces a speedup over traditional methods, which leads to faster and efficient inference without incurring significant loss in accuracy. For example, 8-bit matrix multiplications are sped up by a factor of 2.62× when compared with non-bitsliced rv32i ISA implementation with no hardware multiplier.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量